Springer Series in Statistics 

Advisors: 

P. Diggle, S. Fienberg, K. Krickeberg, 
I. Olkin, N. Wermuth 



Springer 

New York 

Berlin 

Heidelberg 

Barcelona 

Budapest 

Hong Kong 

London 

Milan 

Paris 

Santa Clara 

Singapore 

Tokyo 



Springer Series in Statistics 



AndersenlBorganlGilllKeiding: Statistical Models Based on Counting Processes. 
Andrews/ Herzberg: Data: A Collection of Problems from Many Fields for the Student 

and Research Worker. 
Anscombe: Computing in Statistical Science through APL. 
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition. 
BolfarinelZacks: Prediction Theory for Finite Populations. 
BorglGroenen: Modem Multidimensional Scaling: Theory and Applications 
Bremaud: Point Processes and Queues: Martingale Dynamics. 
Br ockwelll Davis: Time Series: Theory and Methods, 2nd edition. 
Daley/Vere-Jones: An Introduction to the Theory of Point Processes. 
Dzhaparidze: Parameter Estimation and Hypothesis Testing in Spectral Analysis of 

Stationary Time Series. 
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear 

Models. 

Farrell: Multivariate Calculation. 

Federer: Statistical Design and Analysis for Intercropping Experiments. 
FienberglHoaglinlKruskallTanur (Eds.): A Statistical Model: Frederick Mosteller's 

Contributions to Statistics, Science and Public Policy. 
Fisher/Sen: The Collected Works of Wassily Hoeffding. 
Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing 

Hypotheses. 

Goodman! Kruskal: Measures of Association for Cross Classifications. 
Grandell: Aspects of Risk Theory. 

Haberman: Advanced Statistics, Volume I: Description of Populations. 

Hall: The Bootstrap and Edgeworth Expansion. 

Hardle: Smoothing Techniques: With Implementation in S. 

Hartigan: Bayes Theory. 

Heyer: Theory of Statistical Experiments. 

Huet/Bouvier/Gruet/Jolivet: Statistical Tools for Nonlinear Regression: A Practical 

Guide with S-PLUS Examples. 
Jolliffe: Principal Component Analysis. 
KolenlBrennan: Test Equating: Methods and Practices. 
KotzJJohnson (Eds.): Breakthroughs in Statistics Volume I. 
KotzJJohnson (Eds.): Breakthroughs in Statistics Volume II. 
Kres: Statistical Tables for Multivariate Analysis. 
Le Cam: Asymptotic Methods in Statistical Decision Theory. 
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts. 
Longford: Models for Uncertainty in Educational Testing. 
Manoukian: Modern Concepts and Theorems of Mathematical Statistics. 
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition. 
MostellerfWallace: Applied Bayesian and Classical Inference: The Case of The 

Federalist Papers. 



(continued after index) 



Mark J. Schervish 



Theory of Statistics 



With 26 Illustrations 



Springer 




Mark J. Schervish 
Department of Statistics 
Carnegie Mellon University 
Pittsburgh, PA 15213 
USA 



Library of Congress Cataloging-in- Publication Data 
Schervish, Mark J. 

Theory of Statistics / Mark J. Schervish 
p. cm. - {Springer series in statistics) 
Includes bibliographical references (p. - ) and index. 
ISBN- 13: 978-1-4612-8708-7 
1. Mathematical statistics- I. Title. II. Series. 
QA276.S346 1995 

519.5— dc20 95-11235 



Printed on acid-free paper, 

© 1995 Springer- Verlag New York, Inc. 
Softcover reprint of the hardcover 1st edition 1995 

All rights reserved. This work may not be translated or copied in whole or in 
part without the written permission of the publisher ( Springer- Verlag New York, 
Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in 
connection with reviews or scholarly analysis. Use in connection with any form of 
information storage and retrieval, electronic adaptation, computer software, or by 
similar or dissimilar methodology now known or hereafter developed is forbidden. 
The use of general descriptive names, trade names, trademarks, etc., in this pub- 
lication, even if the former are not especially identified, is not to be taken as a 
sign that such names, as understood by the Trade Marks and Merchandise Marks 
Act, may accordingly be used freely by anyone. 

Production managed by Laura Carlson; manufacturing supervised by Joe Quatela. 
Photocom posed pages prepared from the author's M^pC files. 
Printed and bound by Edwards Brothers, Inc., Ann Arbor, MI, 
Printed in the United States of America. 



98765432 (Corrected second printing, 1997) 



ISBN- 1 3 : 978- 1 -46 1 2-8708-7 e-ISBN- 13: 978- 1 -461 2-4250-5 
DOI: 10.1007/978-1-4612-4250-5 



To Nancy, Margaret, and Meredith 



Preface 



This text has grown out of notes used for lectures in a course entitled Ad- 
vanced Statistical Theory at Carnegie Mellon University over several years. 
The course (when taught by the author) has attempted to cover, in one 
academic year, those topics in estimation, testing, and large sample theory 
that are commonly taught to second year graduate students in a math- 
ematically rigorous fashion. Most texts at this level fall into one of two 
categories. They either ignore the Bayesian point of view altogether or 
they cover Bayesian topics almost exclusively. This book covers topics in 
both classical 1 and Bayesian inference in a great deal of generality. My own 
point of view is Bayesian, but I believe that students need to learn both 
types of theory in order to achieve a fuller appreciation of the subject mat- 
ter. Although many comparisons are made between classical and Bayesian 
methods, it is not a goal of the text to present a formal comparison of the 
two approaches as was done by Barnett (1982). Rather, the goal has been 
to prepare Ph.D. students to be able to understand and contribute to the 
literature of theoretical statistics with a broader perspective than would be 
achieved from a purely Bayesian or a purely classical course. 

After a brief review of elementary statistical theory, the coverage of the 
subject matter begins with a detailed treatment of parametric statistical 
models as motivated by DeFinetti's representation theorem for exchangeable 
random variables (Chapter 1). In addition, Dirichlet processes and other 
tailfree processes are presented as examples of infinite-dimensional param- 
eters. Chapter 2 introduces sufficient statistics from both Bayesian and 
non-Bayesian viewpoints. Exponential families are discussed here because 
of the important role sufficiency plays in these models. Also, the concept 
of information is introduced together with its relationship to sufficiency. 
A representation theorem is given for general distributions based on suffi- 
cient statistics. Decision theory is the subject of Chapter 3, which includes 
discussions of admissibility and minimaxity. Section 3.3 presents an ax- 
iomatic derivation of Bayesian decision theory, including the use of condi- 
tional probability. Chapter 4 covers hypothesis testing, including unbiased 
tests, P-values, and Bayes factors. We highlight the contrasts between the 
traditional "uniformly most powerful" (UMP) approach to testing and de- 
cision theoretic approaches (both Bayesian and classical). In particular, we 



x What I call classical inference is called frequent ist inference by some other 
authors. 



viii Preface 



see how the asymmetric treatment of hypotheses and alternatives in the 
UMP approach accounts for much of the difference. Point and set estima- 
tion are the topics of Chapter 5. This includes unbiased and maximum like- 
lihood estimation as well as confidence, prediction, and tolerance sets. We 
also introduce robust estimation and the bootstrap. Equivariant decision 
rules are covered in Chapter 6. In Section 6.2.2, we debunk the common 
misconception of equivariant rules as means for preserving decisions un- 
der changes of measurement scale. Large sample theory is the subject of 
Chapter 7. This includes asymptotic properties of sample quant iles, maxi- 
mum likelihood estimators, robust estimators, and posterior distributions. 
The last two chapters cover situations in which the random variables are 
not modeled as being exchangeable. Hierarchical models (Chapter 8) are 
useful for data arrays. Here, the parameters of the model can be modeled 
as exchangeable while the observables are only partially exchangeable. We 
introduce the popular computational tool known as Markov chain Monte 
Carlo, Gibbs sampling, or successive substitution sampling, which is very 
useful for fitting hierarchical models. Some topics in sequential analysis are 
presented in Chapter 9. These include classical tests, Bayesian decisions, 
confidence sets, and the issue of sampling to a foregone conclusion. 

The presentation of material is intended to be very general and very pre- 
cise. One of the goals of this book was to be the place where the proofs could 
be found for many of those theorems whose proofs were "beyond the scope 
of the course" in elementary or intermediate courses. For this reason, it is 
useful to rely on measure theoretic probability. Since many students have 
not studied measure theory and probability recently or at all, I have in- 
cluded appendices on measure theory (Appendix A) and probability theory 
(Appendix B). 2 Even those who have measure theory in their background 
can benefit from seeing these topics discussed briefly and working through 
some problems. At the beginnings of these two appendices, I have given 
overviews of the important definitions and results. These should serve as 
reminders for those who already know the material and as groundbreaking 
for those who do not. There are, however, some topics covered in Ap- 
pendix B that are not part of traditional probability courses. In particular, 
there is the material in Section B.3.3 on conditional densities with respect 
to nonproduct measures. Also, there is Section B.6, which attempts to use 
the ideas of gambling to motivate the mathematical definition of proba- 
bility. Since conditional independence and the law of total probability are 
so central to Bayesian predictive inference, readers may want to study the 
material in Sections B.3.4 and B.3.5 also. 

Appendix C lists purely mathematical theorems that are used in the text 



2 These two appendices contain sufficient detail to serve as the basis for a full- 
semester (or more) course in measure and probability. They are included in this 
book to make it more self-contained for students who do not have a background 
in measure theory. 



Preface ix 



without proof, and Appendix D gives a brief summary of the distributions 
that are used throughout the text. An index is provided for notation and 
abbreviations that are used at a considerable distance from where they are 
defined. Throughout the book, I have added footnotes to those results that 
are of interest mainly through their value in proving other results. These 
footnotes indicate where the results are used explicitly elsewhere in the 
book. This is intended as an aid to instructors who wish to select which 
results to prove in detail and which to mention only in passing. A single 
numbering system is used within each chapter and includes theorems, lem- 
mas, definitions, corollaries, propositions, assumptions, examples, tables, 
figures, and equations in order to make them easier to locate when needed. 

I was reluctant to mark sections to indicate which ones could be skipped 
without interrupting the flow of the text because I was afraid that readers 
would interpret such markings as signs that the material was not impor- 
tant. However, because there may be too much material to cover, especially 
if the measure theory and probability appendices are covered, I have de- 
cided to mark two different kinds of sections whose material is used at most 
sparingly in other parts of the text. Those sections marked with a plus sign 
(+) make use of the theory of martingales. A lot of the material in some 
of these sections is used in other such sections, but the remainder of the 
text is relatively free of martingales. Martingales are particularly useful in 
proving limit theorems for conditional probabilities. The remaining sections 
that can be skipped or covered out of order without seriously interrupting 
the flow of material are marked with an asterisk (*). No such system is 
foolproof, however. For example, even though essentially all of the material 
dealing with equivariance is isolated in Chapter 6, there is one example in 
Chapter 7 and one exercise that make reference to the material. Similarly, 
the material from other sections marked with the asterisk may occasion- 
ally appear in examples later in the text. But these occurrences should be 
inconsequential. Of course, any instructor who feels that equivariance is an 
important topic should not be put off by the asterisk. In that same vein, 
students really ought to be made aware of what the main theorems in Sec- 
tion 3.3 say (Theorems 3.108 and 3.110), even though the section could be 
skipped without interrupting the flow of the material. 

I would like to thank many people who helped me to write this book or 
who read early drafts. Many people have provided corrections and guidance 
for clarifying some of the discussions (not to mention corrections to some 
proofs). In particular, thanks are due to Chris Andrews, Bogdan Doytchi- 
nov, Petros Hadjicostas, Tao Jiang, Rob Kass, Agostino Nobile, Shingo 
Oue, and Thomas Short. Morris DeGroot helped me to understand what 
is really going on with equivariance. Teddy Seidenfeld introduced me to 
the axiomatic foundations of decision theory. Mel Novick introduced me 
to the writings of DeFinetti. Persi Diaconis and Bill Strawderman made 
valuable suggestions after reading drafts of the book, and those suggestions 
are incorporated here. Special thanks go to Larry Wasserman, who taught 



X 



from two early drafts of the text and provided invaluable feedback on the 
(lack of) clarity in various sections. 

As a student at the University of Illinois at Urbana-Champaign, I learned 
statistical theory from Stephen Portnoy, Robert Wijsman, and Robert 
Bohrer (although some of these people may deny that fact after reading this 
book). Many of the proofs and results in this text bear startling resemblance 
to my notes taken as a student. Many, in turn, undoubtedly resemble works 
recorded in other places. Whenever I have essentially lifted, or cosmetically 
modified, or even only been deeply inspired by a published source, I have 
cited that source in the text. If results copied from my notes as a student 
or produced independently also resemble published results, I can only apol- 
ogize for not having taken enough time to seek out the earliest published 
reference for every result and proof in the text. Similarly, the problems at 
the ends of each chapter have come from many sources. One source used 
often was the file of old qualifying exams from the Department of Statistics 
at Carnegie Mellon University. These problems, in turn, came from various 
sources unknown to me (even the ones I wrote). If I have used a problem 
without giving proper credit, please take it as a compliment. Some of the 
more challenging problems have been identified with an asterisk (*) after 
the problem number. Many of the plots in the text were produced using 
The New S Language and S-Plus [see Becker, Chambers, and Wilks (1988) 
and StatSci (1992)]. The original text processing was done using 
which was written by Lamport (1986) and was based on TJgX by Knuth 
(1984). 

Pittsburgh, Pennsylvania Mark J. Schervish 

May 31, 1995 

Several corrections needed to be made between the first and second print- 
ings of this book. During that time, I created a world-wide web page 

http : //www . stat . emu . edu/~mark/advt/ 

on which readers may find up-to-date lists of any corrections that have 
been required. The most significant individual corrections made between 
the first and second printings are listed here: 

• The discussion of the famous M-estimator on page 314 has been 
corrected. 

• Theorems 7.108 and 7.116 each needed an additional condition con- 
cerning uniform boundedness of the derivatives of the H n and if* 
functions on a compact set. Only small changes were made to the 
proofs. 

• The proofs of Theorems B.83 and B.133 were corrected, and small 
changes were made to Example 2.81 and Definition B.137. 



Contents 



Preface vii 

Chapter 1: Probability Models 1 

1.1 Background 1 

1.1.1 General Concepts 1 

1.1.2 Classical Statistics 2 

1.1.3 Bayesian Statistics 4 

1.2 Exchangeability 5 

1.2.1 Distributional Symmetry 5 

1.2.2 Frequency and Exchangeability 10 

1.3 Parametric Models 12 

1.3.1 Prior, Posterior, and Predictive Distributions 13 

1.3.2 Improper Prior Distributions 19 

1.3.3 Choosing Probability Distributions 21 

1.4 DeFinetti's Representation Theorem 24 

1.4.1 Understanding the Theorems 24 

1.4.2 The Mathematical Statements 26 

1.4.3 Some Examples 28 

1.5 Proofs of DeFinetti's Theorem and Related Results* 33 

1.5.1 Strong Law of Large Numbers 33 

1.5.2 The Bernoulli Case 36 

1.5.3 The General Finite Case* 38 

1.5.4 The General Infinite Case 45 

1.5.5 Formal Introduction to Parametric Models* 49 

1.6 Infinite- Dimensional Parameters* 52 

1.6.1 Dirichlet Processes 52 

1.6.2 Tailfree Processes* 60 

1.7 Problems 73 

Chapter 2: Sufficient Statistics 82 

2.1 Definitions 82 

2.1.1 Notational Overview 82 

2.1.2 Sufficiency 83 

2.1.3 Minimal and Complete Sufficiency 92 

2.1.4 Ancillarity 95 

2.2 Exponential Families of Distributions 102 



* Sections and chapters marked with an asterisk may be skipped or covered 
out of order without interrupting the flow of ideas. 

+ Sections marked with a plus sign include results which rely on the theory of 
martingales. They may be skipped without interrupting the flow of ideas. 



xii Contents 



2.2.1 Basic Properties 102 

2.2.2 Smoothness Properties 105 

2.2.3 A Characterization Theorem* 109 

2.3 Information 110 

2.3.1 Fisher Information Ill 

2.3.2 Kullback-Leibler Information 115 

2.3.3 Conditional Information* 118 

2.3.4 Jeffreys' Prior* 121 

2.4 Extremal Families* 123 

2.4.1 The Main Results 124 

2.4.2 Examples 127 

2.4.3 Proofs* 129 

2.5 Problems 138 

Chapter 3: Decision Theory 144 

3.1 Decision Problems 144 

3.1.1 Framework 144 

3.1.2 Elements of Bayesian Decision Theory 146 

3.1.3 Elements of Classical Decision Theory 149 

3.1.4 Summary 150 

3.2 Classical Decision Theory 150 

3.2.1 The Role of Sufficient Statistics 150 

3.2.2 Admissibility 153 

3.2.3 James-Stein Estimators 163 

3.2.4 Minimax Rules 167 

3.2.5 Complete Classes 174 

3.3 Axiomatic Derivation of Decision Theory* 181 

3.3.1 Definitions and Axioms 181 

3.3.2 Examples 186 

3.3.3 The Main Theorems 188 

3.3.4 Relation to Decision Theory 189 

3.3.5 Proofs of the Main Theorems* 190 

3.3.6 State-Dependent Utility* 205 

3.4 Problems 208 

Chapter 4: Hypothesis Testing 214 

4.1 Introduction 214 

4.1.1 A Special Kind of Decision Problem 214 

4.1.2 Pure Significance Tests 216 

4.2 Bayesian Solutions 218 

4.2.1 Testing in General 218 

4.2.2 Bayes Factors 220 

4.3 Most Powerful Tests 230 

4.3.1 Simple Hypotheses and Alternatives 233 

4.3.2 Simple Hypotheses, Composite Alternatives 238 

4.3.3 One-Sided Tests 239 

4.3.4 Two-Sided Hypotheses 246 

4.4 Unbiased Tests 253 

4.4.1 General Results 253 



Contents xiii 

4.4.2 Interval Hypotheses 255 

4.4.3 Point Hypotheses 257 

4.5 Nuisance Parameters 265 

4.5.1 Neyman Structure 265 

4.5.2 Tests about Natural Parameters 268 

4.5.3 Linear Combinations of Natural Parameters 272 

4.5.4 Other Two-Sided Cases* 272 

4.5.5 Likelihood Ratio Tests 274 

4.5.6 The Standard F-Test as a Bayes Rule* 276 

4.6 P- Values 279 

4.6.1 Definitions and Examples 279 

4.6.2 P- Values and Bayes Factors 283 

4.7 Problems 285 

Chapter 5: Estimation 296 

5.1 Point Estimation 296 

5.1.1 Minimum Variance Unbiased Estimation 297 

5.1.2 Lower Bounds on the Variance of Unbiased Estimators . . . 301 

5.1.3 Maximum Likelihood Estimation 307 

5.1.4 Bayesian Estimation 309 

5.1.5 Robust Estimation* 310 

5.2 Set Estimation 315 

5.2.1 Confidence Sets 315 

5.2.2 Prediction Sets* 324 

5.2.3 Tolerance Sets* 325 

5.2.4 Bayesian Set Estimation 327 

5.2.5 Decision Theoretic Set Estimation* 328 

5.3 The Bootstrap* 329 

5.3.1 The General Concept 329 

5.3.2 Standard Deviations and Bias 335 

5.3.3 Bootstrap Confidence Intervals 336 

5.4 Problems 339 

Chapter 6: Equivariance* 344 

6.1 Common Examples 344 

6.1.1 Location Problems 344 

6.1.2 Scale Problems* 350 

6.2 Equivariant Decision Theory 353 

6.2.1 Groups of Transformations 353 

6.2.2 Equivariance and Changes of Units 359 

6.2.3 Minimum Risk Equivariant Decisions 363 

6.3 Testing and Confidence Intervals* 375 

6.3.1 P- Values in Invariant Problems 375 

6.3.2 Equivariant Confidence Sets 379 

6.3.3 Invariant Tests* 380 

6.4 Problems 3gg 



xiv Contents 



Chapter 7: Large Sample Theory 394 

7.1 Convergence Concepts 394 

7.1.1 Deterministic Convergence 394 

7.1.2 Stochastic Convergence 395 

7.1.3 The Delta Method 401 

7.2 Sample Quantiles 404 

7.2.1 A Single Quantile 404 

7.2.2 Several Quantiles 408 

7.2.3 Linear Combinations of Quantiles* 410 

7.3 Large Sample Estimation 412 

7.3.1 Some Principles of Large Sample Estimation 412 

7.3.2 Maximum Likelihood Estimators 415 

7.3.3 MLEs in Exponential Families 418 

7.3.4 Examples of Inconsistent MLEs 420 

7.3.5 Asymptotic Normality of MLEs 421 

7.3.6 Asymptotic Properties of M-Estimators* 424 

7.4 Large Sample Properties of Posterior Distributions 428 

7.4.1 Consistency of Posterior Distributions 4 " 429 

7.4.2 Asymptotic Normality of Posterior Distributions 435 

7.4.3 Laplace Approximations to Posterior Distributions* .... 446 

7.4.4 Asymptotic Agreement of Predictive Distributions" 1 " .... 455 

7.5 Large Sample Tests 458 

7.5.1 Likelihood Ratio Tests 458 

7.5.2 Chi-Squared Goodness of Fit Tests 461 

7.6 Problems 467 

Chapter 8: Hierarchical Models 476 

8.1 Introduction 476 

8.1.1 General Hierarchical Models 476 

8.1.2 Partial Exchangeability* 479 

8.1.3 Examples of the Representation Theorem* 480 

8.2 Normal Linear Models 483 

8.2.1 One-WayANOVA 483 

8.2.2 Two- Way Mixed Model ANOVA* 488 

8.2.3 Hypothesis Testing 491 

8.3 Nonnormal Models* 495 

8.3.1 Poisson Process Data 495 

8.3.2 Bernoulli Process Data 497 

8.4 Empirical Bayes Analysis* 500 

8.4.1 Naive Empirical Bayes 500 

8.4.2 Adjusted Empirical Bayes 503 

8.4.3 Unequal Variance Case 504 

8.5 Successive Substitution Sampling 505 

8.5.1 The General Algorithm 505 

8.5.2 Normal Hierarchical Models 512 

8.5.3 Nonnormal Models 517 

8.6 Mixtures of Models 519 

8.6.1 General Mixture Models 519 

8.6.2 Outliers 521 



Contents xv 

8.6.3 Bayesian Robustness 524 

8.7 Problems 532 

Chapter 9: Sequential Analysis 536 

9.1 Sequential Decision Problems 536 

9.2 The Sequential Probability Ratio Test 548 

9.3 Interval Estimation* 558 

9.4 The Relevance of Stopping Rules 562 

9.5 Problems 567 

Appendix A: Measure and Integration Theory 570 

A.l Overview 570 

A.l.l Definitions 570 

A. 1.2 Measurable Functions 572 

A. 1.3 Integration 573 

A. 1.4 Absolute Continuity 574 

A. 2 Measures 575 

A. 3 Measurable Functions 582 

A. 4 Integration 587 

A. 5 Product Spaces 593 

A. 6 Absolute Continuity 597 

A. 7 Problems 602 

Appendix B: Probability Theory 606 

B. l Overview 606 

B. l.l Mathematical Probability 606 

B.l. 2 Conditioning 607 

B.1.3 Limit Theorems 611 

B.2 Mathematical Probability 612 

B.2.1 Random Quantities and Distributions 612 

B.2. 2 Some Useful Inequalities 613 

B.3 Conditioning 615 

B.3.1 Conditional Expectations 615 

B.3.2 Borel Spaces* 619 

B.3.3 Conditional Densities 623 

B.3.4 Conditional Independence 628 

B.3.5 The Law of Total Probability 632 

B.4 Limit Theorems 634 

B.4.1 Convergence in Distribution and in Probability 634 

B.4.2 Characteristic Functions 639 

B.5 Stochastic Processes 645 

B.5.1 Introduction 645 

B.5.2 Martingales* 645 

B.5.3 Markov Chains* 650 

B.5. 4 General Stochastic Processes 651 

B.6 Subjective Probability 654 

B.7 Simulation* 659 

B.8 Problems 661 



xvi Contents 



Appendix C: Mathematical Theorems Not Proven Here 665 

C.l Real Analysis 665 

C.2 Complex Analysis 666 

C. 3 Functional Analysis 667 

Appendix D: Summary of Distributions 668 

D. l Univariate Continuous Distributions 668 

D.2 Univariate Discrete Distributions 672 

D.3 Multivariate Distributions 674 

References 675 

Notation and Abbreviation Index 689 

Name Index 691 

Subject Index 694 



Chapter 1 
Probability Models 



1.1 Background 

The purpose of this book is to cover important topics in the theory of statis- 
tics in a very thorough and general fashion. In this section, we will briefly 
review some of the basic theory of statistics with which many students are 
familiar. All that we do here will be repeated in a more precise manner at 
the appropriate place in the text. 

1.1.1 General Concepts 

Most paradigms for statistical inference make at least some use of the fol- 
lowing structure. We suppose that some random variables X\, . . . , X n all 
have the same distribution, but we may be unwilling to say what that distri- 
bution is. Instead, we create a collection of distributions called a parametric 
family and denoted Vq. For example, Vo might consist of all normal distri- 
butions, or just those normal distributions with variance 1, or all binomial 
distributions, or all Poisson distributions, and so forth. Each of these cases 
has the property that the collection of distributions can be indexed by a 
finite-dimensional real quantity, which is commonly called a parameter. For 
example, if the parametric family is all normal distributions, then the pa- 
rameter can be denoted © = (M,E), where M stands for the mean and 
E stands for the standard deviation. The set of all possible values of the 
parameter is called the parameter space and is often denoted by fi. When 
0 = 0, the distribution of the observations is denoted by Pq. Expected 
values are denoted as E^(-). 

We will denote observed data X. It might be that X is a vector of ob- 



2 Chapter 1. Probability Models 



servations that are mutually independent and identically distributed (IID), 
or X might be some general quantity. The set of possible values for X is 
the sample space and is often denoted as X. The members Pq of the para- 
metric family will be distributions over this space X. If X is continuous or 
discrete, then densities or probability mass functions 1 exist. We will denote 
the density or mass function for Pq by fx\e{'\6)- For example, if X is a 
single random variable with continuous distribution, then 



If X = (Xi,...,X n ), where the Xi are IID each with density (or mass 
function) /x^eH^) when 0 = 0, then 



where x = (x\, . . . ,x n ). After observing the data X\ = x\ ) . ... ,X n = x n , 
the function in (1.1), as a function of 6 for fixed is called the likelihood 
function, denoted by L(6). Section 1.3 is devoted to a motivation of the 
above structure based on the concept of exchangeability and DeFinetti's 
representation theorem 1.49. Exchangeability is discussed in detail in Sec- 
tion 1.2, and DeFinetti's theorem is the subject of Section 1.4. 

1.1.2 Classical Statistics 

Classical inferential techniques include tests of hypotheses, unbiased esti- 
mates, maximum likelihood estimates, confidence intervals and many other 
things. These will be covered in great detail in the text, but we remind the 
reader of a few of them here. Suppose that we are interested in whether or 
not the parameter lies in one portion of the parameter space. We could 
then set up a hypothesis H : Q € 0// with the corresponding alternative 
A : O £ Qh- The simplest sort of test of this hypothesis would be to choose 
a subset R C X, and then reject H if x e R is observed. The set R would 
be called the rejection region for the test. If x R, we would say that we 
do not reject H. Tests are compared based on their power functions. The 
power function of a test with rejection region R is (3(6) = Pe(X e R). The 
size of a test is sup (9€aH (3(6). Chapter 4 covers hypothesis testing in depth. 

Example 1.2. Suppose that X = (Xi, . . . , X n ) and the Xi are IID with iV(0, 1) 
distribution under P e . The usual size a test of H : 6 = 0o versus A : 6 ^ 0 O is 



1 Using the theory of measures (see Appendix A) we will be able to dispense 
with the distinction between densities and probability mass functions. They will 
both be special cases of a more general type of "density." 




n 



fx l e(x\e) = Hf Xlle (x i \d), 



(1.1) 



1.1. Background 3 



to reject H if X € R, where X is the sample average, 
R = ( -oo, 0 O + 



and $ is the standard normal cumulative distribution function (CDF). 

The notation and terminology in Chapter 4 are different from the above 
because we consider a more general class of tests called randomized tests. 
These are special cases of randomized decision rules, which are introduced 
in Chapter 3. The following example illustrates the reason that randomized 
decisions are introduced. 

Example 1.3. Let X ~ Z?m(5,0) given 0 = 0. Suppose that we wish to test 
H : 0 < 1/2 versus A : 0 > 1/2. It might seem that the best test would be 
to reject H if X > c, where c is chosen to make the test have the desired level. 
Unfortunately, only six different levels are available for tests of this form. For 
example, if c G [4, 5), the test has level 1/32. If c e [3, 4), the test has level 3/16, 
and so on. If you desire a level such as 0.05, you must use a more complicated 
test. 

A function of the data which takes values in the parameter space is called 
a (point) estimator of 0. Section 5.1 considers point estimation in depth. 

Example 1.4. Suppose that X = (X u . . . ,X n ) and the Xi are IID with N(0, 1) 
distribution under P e , then 4>{x) = ^i/n = x takes values in the parameter 

space and can be considered an estimator of 0. 

Sometimes we wish to estimate a function g of 0. An estimator </> of 
g(G) is unbiased if E e [cf)(X)] = g(0) for all 6 € Q. An estimator 0 of 0 is a 
maximum likelihood estimator (MLE) if 

supL(6) = L(<f>(x)), 
eeri 

for all x e X. An estimator </> of ^(0) is an MLE if i/j(X) = g((f)(X)), 
where ^ is an MLE of 0. The reader should verify that the estimator 0 in 
Example 1.4 is both an unbiased estimator and an MLE of 0. 

If the parameter 0 is real-valued, it is common to provide interval es- 
timates of 0. If (A, B) is a pair of random variables with A < B, and 
if 

Pe{A < 6 < B) > 7, 
for all 9 e ft, then [^4, B] is called a coefficient 7 confidence interval for 0. 
Section 5.2 covers the theory of set estimation, which includes confidence 
intervals, prediction intervals, and tolerance intervals as special cases. 

Example 1.5 (Continuation of Example 1.4). Suppose that X = (Xi, . . . , X n ) 
and the Xi are IID with N{0, 1) distribution under P 0 , and let 

A = X-^= y £ = * 
yjn y/n 

where c> 0. Then [A, B] is a coefficient 2$(-c) confidence interval for 0, where 
$ is the standard normal CDF. 



4 Chapter 1. Probability Models 



1.1.3 Bayesian Statistics 

In the Bayesian paradigm, one treats all unknown quantities as random 
variables and constructs a joint probability distribution for all of them. 
Using the same setup as in Section 1.1.1, this would require that one con- 
struct a distribution for the parameter 0 in addition to the conditional 
distribution of X given 0 = 0, which was denoted by Pq. The distribution 
of 6 is called the prior distribution. Together, the prior distribution and 
{Pe : 6 eft} determine a joint distribution on the space X x ft. For exam- 
ple, suppose that the prior distribution has a density /e, suppose that X 
is continuous, and let B C X x ft. Then 

Pr((X,0)eB) = J J I B (x,0)f X \e(x\0)fe(0)dxde, 

where Is is the indicator function of the set B. It will often be possible 
(although not necessary) to think of the space X x ft as if it were the 
underlying probability space 5 which is introduced in Appendix B. In this 
way, X and 0 are both easily recognized as functions from S to their 
respective ranges. That is, if s = (x,0), then X(s) = x and 0(s) = 9. 

After observing the data X = x, one constructs the conditional distri- 
bution of 0 given X = x, which is called the posterior distribution, using 
Bayes ' theorem: 

/ u>\ ^ fx\e(x\0)fe(O) n R . 

Jnfx\e{x\t)fe{t)dt 

A popular method of finding the posterior distribution is to note that the 
denominator of (1.6) is not a function of 9. (In fact, the denominator in 
(1.6) is called the prior predictive density of the data X, fx(x).) This 
means that we can find fs\x{9\x) by calculating the numerator of (1.6) 
and then dividing it by whatever constant is required to make it a density 
as a function of 9. 

Example 1.7 (Continuation of Example 1.4; see page 3). Suppose that X = 
(Xi,...,X„) and the Xi are conditionally IID with N(0, 1) distribution given 
6 = 0. Suppose that the prior distribution of 6 is jV(0o,l/A), where 0 O and A 
are known constants. The likelihood function is 



/ n 
fx\e{x\0) = (2*)" « expj -xf - \ 



»=1 



and the prior density is f e (0) = VX(2tt)~ 1/2 exp(-A[0 - 9 0 ] 2 /2). Multiplying 
these together and simplifying yield the following expression for the numerator 
of (1.6): 

fc(z)exp(-^[0-0i] 2 ), (1.8) 

where Oi = (A0 o + nx)/(A + n), and k(x) does not depend on 0. The expression in 
(1.8) is easily recognized as being proportional to the N(9 U l/[A+n]) density as a 
function of 9. So, the posterior distribution of 6 given X = x is N(9 U l/[A + n]). 



1.2. Exchangeability 5 



Inferences about O, in the Bayesian paradigm, are based on the posterior 
distribution. For example, one might use the posterior mean or median of 
0 as an estimate of 6. In Example 1.7 on page 4, the posterior mean and 
median are both 6\ . The Bayesian paradigm also accommodates inference 
about future observables. If Y denotes some future observations that are 
conditionally independent of X given 0, such as Y = (X n+U . . . ,X n + m ), 
then the posterior predictive density of Y is 

fr\x(y\x) = / fy\s(y\e)fe\x(0\x)de. 

Example 1.9 (Continuation of Example 1.7; see page 4). Let Y = X„+i, the 
next observation. The posterior predictive density of Y is 

l -^[o-o l ] 2 )de 

which is the density of the N(0 U 1 4- l/[n + A]) distribution. 

The theory of prior, posterior, and predictive distributions is introduced 
in Section 1.3.1. Many Bayesian inferential techniques tend to be decision 
theoretic, so the theory of decisions is introduced in Chapter 3. In the text, 
Bayesian techniques are usually introduced at locations nearby those at 
which corresponding classical techniques are introduced. 



= Vn-\- X t n + A , 

v/27r(n-fA + l) eXP V 2(n + A + l) l2/ ~ 



1 . 2 Exchangeability 
1.2.1 Distributional Symmetry 

When one performs a statistical analysis, there are usually several quanti- 
ties about which one is uncertain. For example, when conducting a political 
poll, one never knows in advance which of several answers each respondent 
will provide. In addition, even after the responses are in, one does not know 
the answers that would have been supplied by all of the people who were 
not polled. If one is interested in the proportions of the population who 
would provide each of the available responses, then all of the would-be re- 
sponses of all members of the population are potentially of interest. The 
most complete specification of a probability distribution would give the 
joint distribution of all of these responses. From this joint distribution, the 
distributions of the various proportions of interest could also be calculated. 

The quantities of interest can be more complicated than counts and pro- 
portions without changing the basic considerations. For example, a com- 
pany may keep track of the total amount of a sample of its sale to a sample 
of its customers at a sample of its stores on a sample of days. It may be 



6 Chapter 1. Probability Models 



interested in various average sales amounts across different stores in a sin- 
gle department or across different departments in a single store, or across 
different days, and so on. Once again, the joint distribution of all vectors of 
total sale, register, store, and day would facilitate answering the questions 
of interest. 

How does (or should) one construct the probability distributions needed 
in such examples, and how does one draw inferences from the various types 
of data? Some of the more common ways to draw inferences were described 
briefly in Section 1.1. In order better to understand probability and in- 
ference, let us take a very simplistic example, which should not be too 
encumbered by considerations of available scientific knowledge. Consider 
an old-fashioned thumbtack 2 (one of the metal ones with a round, curved 
head, not the colored plastic ones). We will toss this thumbtack onto a soft 
surface 3 and keep track of whether it comes to stop with the point up or 
with the point down. In the absence of any information to distinguish the 
tosses or to suggest that tosses occurring close together in time are any 
more or less likely to be similar to or different from each other than those 
that are far apart in time, it seems reasonable to treat the different tosses 
symmetrically. We might also believe that although we might only toss the 
thumbtack a few times, if we were to toss it many more times, the same 
judgment of symmetry would continue to apply to the future tosses. Under 
these conditions, it is traditional to model the outcomes of the tosses as in- 
dependent and identically distributed (IID) random variables with Xi = 1 
meaning that toss i is point up and X* = 0 meaning that toss i is point 
down. In the classical framework, one invents a parameter, say 0, which is 
assumed to be a fixed value not yet known to us. 4 Then one says that the 
Xi are IID with Pr(Xi = 1) = 0. Within a Bayesian framework, one might 



2 This example is described in detail by Lindley and Phillips (1976). Other 
interesting examples of how exchangeability aids in the understanding of inference 
problems were given by Lindley and Novick (1981). This example is used, in 
preference to tossing of coins, because most readers will not have particularly 
strong prior opinions about how a thumbtack will land. On the other hand, 
most people believe that the typical coin selected from one's pocket or purse has 
probability pretty near 1/2 of landing head up. 

3 This is done to avoid damaging the thumbtack. This is the last scientific 
consideration we will make. 

4 A great deal of controversy in statistics arises out of the question of the mean- 
ing of such quantities. DeFinetti (1974) argues persuasively that one need not 
assume the existence of such things. Sometimes they are just assumed to be un- 
defined properties of the experimental setup which magically make the outcomes 
behave according to our probability models. Sometimes they are denned in terms 
of the sequence of observations themselves (such as limits of relative frequencies). 
This last is particularly troublesome because the sequence of observations does 
not yet exist and hence the limit of relative frequency cannot be a fixed value 
yet. 



1.2. Exchangeability 7 



construct a probability distribution /i for this unknown 6 and say that 
Pr(X 1 =x u ...,X n = x n ) = y , ^ 1+ -+^(l-^-^---^d^). (1.10) 

It seems unfortunate that so much machinery as assumptions of mutual 
independence and the existence of a mysterious fixed but unknown 6 must 
be introduced to describe what seems, on the surface, to be a relatively 
simple situation. One purpose of this chapter is to show how to replace the 
heavy probabilistic assumptions of IID and "fixed but unknown 0" with 
a minimal assumption that reflects nothing more than the symmetry ex- 
pressed in the problem. At the same time, we will be able to understand 
when models like that of (1.10) are appropriate and why relative frequency 
is such a popular device for thinking about probabilities. For example, when 
considering the tosses of the thumbtack, we said that we would treat the 
information to be obtained from any one toss in exactly the same way as we 
would treat the information from any other toss. Similarly, we would treat 
the information to be obtained from any two tosses in exactly the same way 
as we would treat the information from any other two tosses regardless of 
where they appear in the sequence of tosses, and so on for three or more 
tosses. This may seem like a heavy probabilistic assumption in itself. But it 
really is nothing more than an explicit expression of the symmetry amongst 
the tosses. Anything less would imply asymmetric treatment of the obser- 
vations. Note that assuming the tosses to be IID assumes this symmetry 
and more. The symmetry is quite explicit in formula (1.10). Every permu- 
tation of the numbers x u ...,x n leads to the same value of the right-hand 
side of (1.10). If we assume nothing more than this permutation symmetry 
for a potentially infinite sequence of possible tosses of the thumbtack, then 
Theorem 1.49 5 will imply that there exists fi such that (1.10) holds. In a 
sense, the quantity 0 is given an implicit meaning as a random variable 
6, rather than a fixed value, without having to explicitly give it meaning 
in advance. (See Example 1.45 on page 25.) Furthermore, the observations 
are not necessarily mutually independent, but they will be conditionally 
independent given ©. 

The minimal assumption of symmetry is known as exchangeability, and it 
is no more complicated than the permutation symmetry noticed in (1.10). 
Definition 1.11. A finite setX u ...,X n of random quantities is said to be 
exchangeable if every permutation of (X u . . . , X n ) has the same joint dis- 
tribution as every other permutation. An infinite collection is exchangeable 
if every finite subcollection is exchangeable. 

For example, suppose that X x , . . . , X l00 are exchangeable. It follows eas- 
ily from the definition that they all have the same marginal distribution. 



Theorem 1.47 is a simpler version that applies only to Bernoulli random 
variables. 



8 Chapter 1. Probability Models 

Also, (Xi,X2) has the same joint distribution as (Xg9,Xi), (Xs,X2,X4&) 
has the same joint distribution as (X13, Xioo, ^3), and so on. The following 
fact is easy to prove. 

Proposition 1.12. A collection C of random quantities is exchangeable if 
and only if for every finite n less than or equal to the size of the collection 
C, every n-tuple of distinct elements of C has the same joint distribution 
as every other such n-tuple. 

As an example, we stated earlier that the assumption off IID random 
variables entailed symmetry and more. 

Example 1.13. Consider a collection X\ , X 2 , . . . (finite or infinite) of IID random 
variables. Clearly, (Xi x , . . . ,_X» n ) has the same distribution as (Xj x , . . . , Xj n ) so 
long as ii,...,i n are all distinct and ji, ..,jn are all distinct. Hence, every 
collection of IID random variables is exchangeable. 

The motivation for the definition of exchangeability is to express sym- 
metry of beliefs about the random quantities in the weakest possible way. 
The definition, as stated, does not require any judgment of independence 
or that any limit of relative frequencies will exist. It merely says that the 
labeling of the random quantities is immaterial. There are many situations 
in which this assumption is deemed reasonable, and many where it is not. 
For example, consider the company that sampled sales on various days at 
various stores. It might seem reasonable to declare that the sales at a par- 
ticular store on a particular day are exchangeable. But the collection of 
all sales on all days at all stores might be modeled less symmetrically. In 
Chapter 8, we will discuss in more detail cases with less symmetry. 

Back in the old days, before probability theory was overrun by cr-fields 
and the like, the concept of symmetry was central to most calculations of 
probabilities. Consider, for example, the first paragraph of the book by 
DeMoivre (1756): 

The Probability of an Event is greater or less according to the 
number of Chances by which it may happen, compared with 
the whole number of Chances by which it may either happen 
or fail. 

DeMoivre was describing a judgment of symmetry amongst the possible 
outcomes of some experiment. But other authors, such as Venn (1876), rely 
on symmetry amongst a collection of random quantities to define probabili- 
ties as frequencies. 6 Although we now realize that symmetry is not essential 
to the definition of probability, it nevertheless is a widely used assumption 
that can help facilitate the construction of distributions. In addition, Theo- 
rem 1.49 helps to explain why frequencies are relevant to the calculation of 



6 The reader interested in an in-depth study of the early days of statistics and 
statistical reasoning should read Stigler (1986). 



1.2. Exchangeability 9 



probabilities even though probabilities are not defined as frequencies. (See 
the discussion in Section 1.2.2.) 

In Example 1.13 on page 8, we saw that IID random variables are ex- 
changeable. Exchangeability is more general than IID, however. A very 
common case of exchangeable random quantities is the following. Suppose 
that Xi,X2, . . . are conditionally IID given Y. Then the X { are exchange- 
able. (See Problem 4 on page 73.) 

Example 1.14. Suppose that {X n }£° =1 are conditionally independent with den- 
sity f{x\y) given Y = y and that Y has density g(y). Then the joint density of 
any ordered n-tuple (X^ , . . . , X in ) is 

fx ilt ... t x in (xi,... i x n )= / Y[f{xj\y)g(y)dy. 

J 3=1 

Note that the right-hand side does not depend on i\, . . . , i n . 

The case of conditionally IID random quantities will turn out to be one 
of only two general forms of exchangeability. Theorem 1.49 will say that 
infinitely many random quantities are exchangeable if and only if they are 
conditionally IID given something. 

Although an infinite sequence of exchangeable random variables is condi- 
tionally IID, sometimes the description of their joint distribution does not 
make this fact transparent. Example 1.15 is the famous Polya urn scheme. 
It is not obvious from the example that the random variables constructed 
are conditionally IID. Theorem 1.49, however, says that they are condi- 
tionally IID because they are exchangeable. 7 

Example 1.15. Let X = {1, ... , k}, and let u u . . . , u k be nonnegative integers 
such that u = 5Z* S=1 Ui > 0. Suppose that an urn contains m balls labeled i for 
i = 1, . . . , k. We draw a ball at random 8 and record Xi equal to the label. We 
then replace the ball and toss in one more ball with the same label. We then 
draw a ball at random again to get X 2 and repeat the process indefinitely To 
prove that the sequence {X*}^ is exchangeable, let n > 0 be an integer and let 
Ji, • • • , jn be elements of X. For i = 1, ... , fc, let a(j u . . . , j n ) be the number of 
times that i appears among j u . . . , j n . That is, 9 . . ,j n ) = I {i} ( jt ). 

Define the notation 't-i \ /w 

(a)6 = a(a-l)...(a-6+l), 



Hill Lane and Sudderth (1987) prove that for k = 2, the Polya urn process 
is the only exchangeable urn process aside from IID processes and deterministic 
ones (An urn process is deterministic if all balls drawn are the same. The common 
label for all balls can still be random.) 

8 What we mean by this is that every ball in the urn has the same probability 
of being drawn. 

9 We will often use the symbol I A {x) to stand for the indicator function of the 
set A. That is, I A (x) = 1 if x e A and I A (x) = 0 if x £ A. 



10 Chapter 1. Probability Models 



where (a)o = 1 by convention. Then, we claim that 

Pr(Ai = ji,... ,A„ = jn) = r— — — • (l.lo) 

(u 4- n — l) n 

For n = 1, this reduces to Pr(Xi = ji) = Uj x /u, which is true. If we suppose that 
(1.16) is true for n = 1, . . . ,m, then Pr(Xi = ji, . . . , X m +i = j m +i) equals 

Pr(Xi = ji,...,X m = j m ) Pr(X m+ i = j m +i|Xi = ji,...,X m = j m ) 

_ Pt ./ v — o V — « \ ^Jm-H j" C Jm + l (jl i • • • » Jm-n) ~ 1 

— rr(Ai = ji, . . . , A m = j m ) ■ . (1.17) 

u + m 

In replacing Pr(Xi = ji, . . . ,X m = j m ) by (1.16) in (1.17), we note that 

Ci(jl,...,jm+l) = Ci(ji,...,j m ) ifl^jm+l, 
C 3m+l(3U • • • Jm) = C jm + 1 (jl,...,j m+ l) - 1. 

The result now follows immediately. 

The only other form of exchangeability, besides conditionally IID random 
quantities, is illustrated in a problem as simple as drawing balls without 
replacement from an urn. 

Example 1.18. Suppose that an urn has 20 balls, 14 of which are red and 6 of 
which are blue. Suppose that we draw balls without replacement. Let Xi be 1 if 
the ith ball is red and 0 if it is blue. If we assume that all 20! possible ordered 
draws of the balls are equally likely, then it is not difficult to see that the Xi 
are exchangeable. To see that the draws are not conditionally IID, suppose that 
there were a random quantity Y such that the Xi were conditionally IID given Y. 
Since 0 < Pr(Xi = 0) = E(Pr(Xi = 0\Y)) (by the law of total probability B.70), 
it follows that Pr(Xi = 0\Y) = 0 a.s. is impossible. Hence Pr(Pr(Xi = 0\Y) > 
0) > 0, from which it follows that 

Pr(Pr(Xi = 0, X 2 = 0, . . . , X 7 = 0| Y) > 0) = Pr((Pr(Xi = 0|K)) 7 > 0) 

= Pr(Pr(Xi = 0|Y) > 0) > 0. 

Hence, Pr(Xi = 0, . . . , X 7 = 0) = E(Pr(X x = 0, . . . , X 7 = 0|V )) > 0. But this 
is absurd, since there are only 6 blue balls. It must be the case that the Xi, 
although exchangeable, are not conditionally IID. 

Theorem 1.48 will say that a finite collection of random quantities is 
exchangeable if and only if they are like draws from an urn without re- 
placement. 



1.2.2 Frequency and Exchangeability 

There was a time when people thought that probabilities had to be frequen- 
cies, and as such, we could not know what they were before collecting an 
infinite amount of data. [See Von Mises (1957) for an example.] Although 
it is still true that we cannot know frequencies (such as the limit of the 



1.2. Exchangeability 1 1 



proportion of successes in a sequence of exchangeable Bernoulli random 
variables) without collecting an infinite amount of data, DeFinetti's rep- 
resentation theorem for Bernoulli random variables 1.47 tells us that such 
a limit of frequencies 6 is only a conditional probability given informa- 
tion that we do not yet have. The probabilities themselves are calculated 
based on subjective judgments. The possibly surprising fact is that even 
though different people might calculate different probabilities for the same 
sequence of Bernoulli random variables, if they all believe the sequence 
to be exchangeable, then they all believe that there exists 0 such that 
conditional on B = 0, the random variables are IID Ber(9). That is, the 
subjective judgment of exchangeability for a sequence of random variables 
entails certain consequences that are common to every specific instance of 
the judgment, even when the specific instances differ in other ways. 

Example 1.19. Let {Xn}^ be Bernoulli random variables. Suppose that two 
different people give them the following joint distributions. Let i u i 2l . . . stand 
for numbers in {0, 1}. One person believes 



where x stands for i jt and the other believes this probability to be ([n + 

1 ](x)) _1 - The fir st person believes that Pr(Xi = 1) = 0.4, while the second 
believes Pr(Xi = 1) = 0.5. On the other hand, both of these distributions are 
exchangeable, and so Theorem 1.47 says that both persons believe that 6 = 
hm N ^oo^2 n=1 Xi/N exists with probability 1, and that Pr(Xi = 1|0 = 6) = 
9. They must disagree on the distribution of 6. For example, the law of total 
probability B.70 says PrpTi = 1) = E(O), hence they must have different values 
of E(O). 

If probabilities are not frequencies, then why are frequencies thought 
to be so important in calculating probabilities? The answer lies in careful 
examination of the implications of DeFinetti's representation theorem. 

Example 1.20 (Continuation of Example 1.19). Suppose that the two people in 
this example both observe X = {X, , . . . , X 20 ) = y, and suppose that y consists of 
14 Is and 6 0s. It is not difficult to calculate the conditional distribution of X*i 
given this data For example, to get Pr(X 21 = 1\X = y), we just divide the joint 
beheves 0 = ^ ^ ^ ^ probabilit y of X = y. The first person 

12 1 

Pr(X 2 i = 1\X = y) = = Q 64) 

while the second person believes Pr(X 21 = 1\X = y) = 17/22 = 0.68. Notice how 
much closer these probabilities are to each other than were the prior probabilities 
ot U.4 and 0.5. Also, notice how close each of them is to the proportion of successes, 

In Example 1.57 on page 31, we will see the general method for finding 
the conditional distribution of 0 after observing some Bernoulli trials. But 



12 Chapter 1. Probability Models 



Example 1.20 gives us some hint of what happens. In Example 1.20, after 
observing 20 Bernoulli trials, the mean of 0 changed to a number closer 
to the proportion of successes, regardless of what the prior mean of 0 
was. The conditional mean of 0 given the observed data is the probability 
of a success on a future trial given the data. If we believe a sequence of 
Bernoulli random variables to be exchangeable, and we are not already 
certain about the limit of the proportion of successes, then after we observe 
some data, we will modify our opinion about future observations so that the 
probability of success is now closer to the observed proportion of successes. 
This phenomenon has nothing to do with frequencies being probabilities. 
It is merely a consequence of exchangeability. 

1.3 Parametric Models 

DeFinetti's representation theorem 1.49 says that infinitely many random 
quantities {X n }^L 1 are exchangeable if and only if they are conditionally 
IID given the limit of their empirical probability measures. The empirical 
probability measure (or empirical distribution) of Xi, . . . , X n is the random 
probability measure 



For the case of random variables, the empirical distribution is equivalent 
to the empirical distribution function, F n (t) = ]C™ =1 ^(-oo,t](^i)/ n > the 
function which is 0 at — oo and has jumps of size 1/n at each observation. 

If we are considering a sequence of exchangeable random quantities, let 
0 be some one-to-one function of the limit of the empirical distributions, 
and let ft be the set of possible values for 0. Let Pq denote the conditional 
distribution of X n given 0 = 0. Then V 0 = {P e : 0 € ft} looks like a 
typical parametric family with which we are already familiar. Also, 0 is a 
measurable function of the entire sequence {X n }£° =1 , hence its distribution 
is induced (see Theorem A.81) from the distribution of the sequence. For 
this reason, it is natural to think of 0 as a random quantity in this situation. 

Although DeFinetti's representation theorem 1.49 is central to motivat- 
ing parametric models, it is not actually used in their implementation. 
Furthermore, the concept of parametric models extends to more general 
situations, albeit without the same justification. For this reason, we will 
postpone formal treatment of DeFinetti's theorem until Section 1.4. In Sec- 
tion 1.3.1, we introduce the framework for the use of parametric families in 
general situations. Most familiar examples will be of exchangeable random 
variables, but other examples will be given as well. In all cases, however, 
we will treat the parameter as a random quantity, just as we would if the 
data were exchangeable. 




for every B e B. 



(1.21) 



i=l 



1.3. Parametric Models 13 



1.3.1 Prior, Posterior, and Predictive Distributions 

We begin by making explicit the general concept of parameter and para- 
metric family. 

Definition 1.22. Let (S,A,n) be a probability space, and let (X,B) and 
(ft, r) be Borel spaces. Let X : S — ► X and 0 : S -> ft be measurable. Then 
6 is called a parameter and ft is called a parameter space. The conditional 
distribution for X given 6 is called a parametric family of distributions of 
X. The parametric family is denoted by 

V 0 = {Pe : VA € B, P 0 (A) = Prpf e A\G = 0), for 0 G ft}. 

We also use the symbol Pq{X € A) to stand for P$(A).^ The prior distri- 
bution of 0 is the probability measure /x e over (ft, r) induced by 6 from 
li. 

Suppose that each Pq, when considered as a measure on {X, B), is abso- 
lutely continuous with respect to a measure v on (X, B). Let 

fx\e(x\e) = ^(x). 

(It will be common in this text to denote the conditional density function 
of one random quantity X given another Y by f X \ Y .) We can assume that 
fx\e is measurable with respect to the product a-field B 0 r. 11 This will 
allow us to integrate this function with respect to measures on both X and 
ft. The function f X \&{x\0), considered as a function of 6 after X = x is 
observed, is often called the likelihood function L{9). 

For each 9 e ft, the function /x|e(-|0) is the conditional density with 
respect to v of X given 0 = 0. That is, for each A <E £, 

Pr(* € A|0 = 0) = / / x|@ (*|0)<Ma;). 

•/A 

We let \i x denote the marginal distribution of X {v>x{A) = Pr(X € A)). 
Using Tonelli's theorem A.69, we can write 

= L I a fx&WW*)*^ 0 ) = f A f Q fx\e(x\0)dti e (0)dv(x). 
It follows that fix is absolutely continuous with respect to v with density 

fx(x) = f fx\e(x\e)dfi e (6). ( L2 3) 
j n 

10 In this manner, P' e is a probability measure on the space (S,A) and P 0 is 
a probability measure on the space (X,B). This fine mathematical point could 
usually be ignored without causing much confusion, but we will try to be as 
precise as possible for the sake of those few cases where it matters. 
See Problem 9 on page 74 for a way to prove this. 



14 Chapter 1. Probability Models 



This density is often called the (prior) predictive density of X or the 
marginal density of X. 

For example, suppose that X = (X\, . . . , X n ), where the Xi are ex- 
changeable and conditionally independent given 0, each with conditional 
density fxi\e{'\@) w ^h respect to a measure v l . Then the conditional joint 
density of X\ , . . . , X n given 0 = 0 (the likelihood in this case) with respect 
to the n-fold product measure v n can be written as 

n 

fx u ...,x n \Q{x\,--,x n \0) = n^il^l* 9 )- 

2=1 

The unconditional joint (prior predictive) density of X\, . . . , X n is 

f n 

fx u ... t x n (xu.>.,xn)= / Y[fxi\e(xi\e)dfie(0). 



Example 1.24. Let X = (Xi,...,X„), where the Xi are conditionally IID 
with jV(/z,a 2 ) distribution given (M,E) = (/x,<r). (Here the parameter is 0 = 
(M,E).) Let the prior distribution be that E 2 has inverse gamma distribution 
r~ 1 (a 0 /2,6 0 /2) and M given E = a has AT(/x 0 ,<t 2 /Ao) distribution with a 0 , bo, 
Ho, and Ao constants. The likelihood function in this case can be written as 

fx\e{x\^a) = (2n(T 2 )-% exp(-^ ~ + ™]) ' 

where x = YTi=\ Xi l n and w = Y^=i( x ^~ x ) 2 - The P rior densit y with aspect to 
Lebesgue measure is 

feM = 2 %rM a ~ ia ° +2) exp ("2^ t Ao(/i " Mo)2 + 6o] ) ' for * > °" 

* (1.25) 
The prior predictive distribution of the observations can be calculated by multi- 
plying together the two functions above and integrating out the parameter. After 
completing the square in the exponent, the product can be written as 

2(¥) 2 v^ ff -(.,+2) ( 1 [XUr-ptf+bi]), (1-26) 

where 

fll = ao + n, Ai = Ao + n, 

nA 0 (x-/io) 2 Aq/xq + nx 

Note that, as a function of (/x,<x), this is in the same form as the prior density 
(1.25) with the four numbers a 0 ,bo,Mo,Ao replaced by aubuPuM- ™ e 
integral over (jt t a) is just the constant factor that appears in (1.26) divided by 



1.3. Parametric Models 15 



the result of changing ao,&o,Mo,Ao t° a i>&i>Mi>Ai, respectively, in the constant 
factor in (1.25). That is, 

ftw -^fe ^mi . i^a) . ( , 27) 

(2»)T-r(7) 2 (^)+ v ^7 .n+v^rff ) 

A specialized calculation of the preceding sort is often of interest in this exam- 
ple. Let Y n be the average of the n observations. The conditional distribution of 
Y n given 6 = (/i, a) is iV(/x, (7 2 /n). The prior predictive density of Y n given X = x 
can be calculated by integrating the JV(f/, cr 2 /rc) density times (1.25) with respect 
to ji and a. Alternatively, one can argue as follows. Using well-known features of 
the normal distribution, we can conclude that the conditional distribution of Y n 
given E = a is JV(/xo, <r 2 [l/rc 4- 1/A 0 ]). If we multiply the corresponding normal 
density times the marginal density of E and integrate over <r, we get 

roo y /2{^-)^' / 1 x 

/ ) 2 } a- 0 - 3 exp( --^[6 0 + (y - /xo) 2 ] ) d<x 

r (¥))/^(i + *)^ 60 ' 

which is the density of the t ao (Mo, \/07^ + l/A 0 )6 0 /a 0 ) distribution. 

As we mentioned earlier, the use of parametric families does not require 
that the data be a collection of exchangeable random quantities. Here are 
some examples of nonexchangeable random quantities whose distributions 
could usefully be modeled using finite-dimensional parametric families. 

Example 1.28. Let {X n }£Li be a sequence of Bernoulli random variables that 
are not exchangeable. Instead, let V 0 be the set of joint distributions for infinitely 
many Bernoulli random variables that form a Markov chain. For Fe? 0 , define 

6(P) = (Pr^^ll^^r^^llX^^^^r^^llX^CP)) 
= (Pi,Pn,Poi). 

Let A be any probability over [0, l] 3 , and set 

Pr(Xi =i 1 ,...^x n = i n ) (1.29) 

where k 9%t is the number of times that s follows * in the sequence i u . . i n 
Diaconis and Freedman (1980c) prove that, aside from pathological cases, if all 
finite sequences of 0s and Is that have the same first element and the same values 
of k 3 ,t have the same probability, then (1.29) must hold. 

Example 1.30. This example is the simple linear regression problem. Suppose 
that xi , x 2 , . . . are fixed known numbers and E x , E 2 , . . . are exchangeable random 
variables that are conditionally independent given E = a with density f E \v(e\<r) 



16 Chapter 1. Probability Models 



(Think of the Ei as the error or noise term in a regression model.) Define Y% = 
Ei + B#i, where B is a random variable such that B and E have joint distribution 
Mb,e. The parameter now consists of 6 = (B, E). The random variables Yi , Y2 , — 
are not exchangeable even though E\, E2, . . . are exchangeable. The reader should 
see Zellner (1971) for an in-depth discussion of Bayesian analysis of regression 
models. 

The conditional distribution of 6 given X = x is called the posterior 
distribution o/G. The next theorem shows us how to calculate the posterior 
distribution of a parameter in the case in which there is a measure v such 
that each Pq < v. 

Theorem 1.31 (Bayes' theorem). 12 Suppose that X has a parametric 
family Vo of distributions with parameter space £1. Suppose that Pq <C v 
for all 9 € f), and let fx\e( x \0) be the conditional density (with respect to 
v) of X given 9 = 6. Let /xe be the prior distribution of ©. Let fjL^ x {'\ x ) 
denote the conditional distribution of® given X = x. Then ii&\x < A*e> 
a.s. with respect to the marginal of X, and the Radon-Nikodym derivative 
is 

d »e\x (f)M = fx\e{x\$) 
d»e W ' f a fx\e(x\t)dpe(t) 
for those x such that the denominator is neither 0 nor infinite. The prior 
predictive probability of the set of x values such that the denominator is 
0 or infinite is 0, hence the posterior can be defined arbitrarily for such x 
values. 

Proof. First, we prove the claims about the denominator. Let 
Co = |x: jf/ X |e(a:|t)d/xeW = o} > 

Coo = Jjx\e(x\tWe(t) = 00}. 
Let fix be the marginal distribution of X, 

Hx{A)= f [ fx\e(x\OWe(0)dv(x). 

J AJQ 

It follows that 

Wr(Co) = / / fx\e {x\e)d^{6)dv{x) = 0, 

J Co JQ 

/ix(Coo) = [ [ fx\e(x\0We(0)dv(x) = / ocdu(x). 

J Coo JQ JC oo 



12 Theorem 1.31 applies equally well to infinite-dimensional parameters as to 
finite-dimensional parameters. In infinite-dimensional cases, however, the condi- 
tion P e < v for all 0 often fails. In fact, the proof applies even if (S2,r) is not 
a Borel space. In this last case, a regular conditional distribution is explicitly 
constructed without knowing in advance that one will exist. 



1.3. Parametric Models 17 



This last integral will equal oo if i/(Coo) > 0. Since this is impossible, it 
must be that i/(Coo) = 0, hence /xx(Coo) = 0. 

The posterior distribution /i e | X must satisfy the following. For all sets 
AeB and all Bgt, 

Pr(0 e B,X e A) = J ne\x{B\x)dnx(x). (1.32) 

Using Tonelli's theorem A. 69 we can write 

Pv(9eB,XeA) = [ [ f X \e(x\6)dv(x)diie(0) 

JB J A 

= / / fx\B{x\0)di*(0)dv{x). (1.33) 

J A JB 

Next, write 

Combining this with (1.33) shows that (1.32) is satisfied for all A and B if 
and only if 

„ (rm S B fx\B(x\e)dns(0) 

a.s. [fj, x ]. It follows that < Me and that dii&\x / dn®{'\x) is as speci- 
fied. □ 

Example 1.34. Suppose that X has Bin(n,0) given 6 = 0 and that the prior 
distribution of 6 is Beta(a 0 ,bo). The marginal density of X with respect to 
counting measure on the integers is 



fx(x) = M r(ao + 6 0 )r(a 0 + x)r(b 0 +n-x) 
\x) r(a 0 )r(6 0 )r(ao + 6o + n) ' 



for x = 0, . . . , n. 



Th fn?°x teri ° r l- n x Sity ° f 9 with res P ect to the P rior distribution of 6 is the ratio 
of Q0 X (1 -0) n x to fx(x). The posterior density of 6 with respect to Lebesgue 
measure is 

which is easily seen to be the Beta(a 0 + x, 6 0 + n - x) density. 

Example 1.35 (Continuation of Example 1.24; see page 14). For the case of 
conditionally IID iV(/x,cr 2 ) random variables, the posterior density with respect 
to Lebesgue measure can be calculated by dividing the product of prior and like- 
lihood (1.26) by the prior predictive density (1.27). The result is easily seen to be 
in the same form as the prior density (1.25) with the four constants a 0 , feo, Mo, A 0 
replaced by ai, &i, /n, Ai. In other words, the posterior distribution of E 2 is 
T (ai/2, 6i/2), and the conditional posterior of M given E = cr is N{n u a 2 /\x). 



18 Chapter 1. Probability Models 



Example 1.36. As an example in which Bayes' theorem does not apply, consider 
the case in which the conditional distribution of X given 6 = 9 is discrete with 
Po({0 - 1}) = P e ({9 + 1}) = 1/2. Suppose that 8 has a density / e with respect 
to Lebesgue measure. The P$ distributions are not all absolutely continuous with 
respect to a single a- finite measure. It is still possible to verify that the posterior 
distribution of 0 given X = x is the discrete distribution with 

Pr(G = x - 1\X = x) - - M x ~ l ) 



/e(*-l) + /e(* + l)' 



and Pr(6 = x + l\X = x) = 1 - Pr(6 = x - \\X = x). Note that the posterior 
is not absolutely continuous with respect to the prior. 

The (posterior) predictive distribution of future data is defined in the 
same way as the prior predictive distribution except that the posterior 
distribution of 0 is used instead of the prior distribution of 0. For the case 
of conditionally IID random variables with conditional density /xi|e given 
0, we have 

/x n+1 ,...,X n+fc |X 1 ,...,X n (^n+l, • • . ,X n+fc |Xi, . . . ,X„) (1-37) 

Example 1.38 (Continuation of Example 1.35; see page 17). The posterior pre- 
dictive distribution of future observations can be calculated after observing a 
sample of conditionally IID normal random variables. Let Y m be the average of 
m future observations. Since the posterior distribution of 0 is in the same form 
as the prior (1.25) with a 0 , &o, Mo, Ao replaced by ai, &i, /xi, Ai, it follows that the 
posterior predictive distribution of Y m is of the same form as the prior predictive 
distribution. Using the result from the end of Example 1.2 4 on page 14, we g et 
that the posterior predictive distribution of Y m is t ai {ni, y/[l/m + l/Ai]6i/ai ). 

To see how Bayes' theorem 1.31 applies to arbitrary random quantities 
whose distributions are modeled using parametric families, consider the 
following example. 

Example 1.39. Consider two sequences {X n }n=i and {V n }n=i of random vari- 
ables that are each separately exchangeable. We can model them so that the 
parameters are related. For example, suppose that the Xi are IID Exp(0) given 
6 = 0 and the Yj are IID t/(O,0) given 0 = 6, and we model the Xi and Yj 
as conditionally independent given 0. We may learn Xi = xi, . . . , X n = x n and 
then wish to make inference about ^s. Let the prior for 0 be /x e . The posterior 

18 _ J R 0 n exp(-x9)dne(O) 

fie(B\x u . . . , x n ) - j,^ exp{ _ x ,p WeW » 



13 Another example of this situation occurs in Problem 47 on page 80. In that 
example, 0 is an infinite-dimensional parameter, however. 



1.3. Parametric Models 19 



where x = X^=i Xi ' ^ ne predictive density of (Vi, . . . , Y m ) is 

/y 1 ,...,v m |x 1 ,...,x n (2/i, • • • ,y™\x\, . . . , x n ) = / —dne(0\xi,. . . ,x n ). 

J max yi 

Since P$ is a conditional distribution given another random variable 0 = 
0, there exist conditional expectations given 0 = 0. Let E# stand for the 
expectation operator under P' Q . That is, if Z is a random variable with 
finite absolute expectation, then E$(Z) means E(Z\@)(s) for all s such 
that 6(5) = 0. By Theorem B.12, if / : X -> JR and Z = }{X), 

E e (Z) = J Z(x)dP e {x). 

Similarly, let Va,r 0 (X) and Cov e (X,Y) stand for the conditional variance 
of X given 0 = 0 and the conditional covariance between X and Y given 
0 = 0, respectively. 

There will be times when we wish to condition on other random variables 
in addition to 0. Recall that for two random quantities Z : S -> IR and 
Y : S — ► T, the conditional expectation of Z given y was defined to be an 
Ay measurable function E(Z\Y)(s) satisfying 

E{ZI B ) = / E(Z\Y)(s)dii(s\ 

JB 

for all B £ Ay. The conditional expectation of Z given Y and 0 will be 
an A(y t e) measurable function E(Z\Y,Q) satisfying 

E(Z/ B )= / E(Z\Y,9)(s)dLi(s), 

JB 

for all B € A {Y} e)- It follows from Theorem B.75 that E(Z\Y = y,S = 0) = 
E 0 (Z\Y = y), where E 0 {-\Y = ?/) is conditional expectation calculated from 
P^. It follows from the law of total probability B.70 that 

E(Z\Y = „) = / E(Z|y = y, G = 0)d/ie|y («|tf), 
where /ie|y(-|y) specifies the conditional distribution of 0 given Y = y. 
1.3.2 Improper Prior Distributions 

Two components are required to specify the distribution of a random quan- 
tity X by means of a parametric family. One is the choice of parametric 
family, and the other is the prior distribution over the parameter space. 
Both of these must be specified if one is to have a marginal distribution for 
X . Some people seem to think that choosing a prior distribution introduces 



20 Chapter 1. Probability Models 



subjectivity into the analysis of data but choosing a parametric family does 
not. These people are mistaken. Each choice one makes introduces subjec- 
tivity. 

Philosophy aside, suppose that one finds it difficult to specify a prior 
distribution beacause one does not have much idea where the parameter is 
likely to be located. In such cases, one may wish to do calculations based 
on a prior distribution that spreads the probability very thinly over the 
parameter space. A problem that often arises is that, if we take the limit 
as the probability is spread more thinly, the prior distribution ceases to 
satisfy the axioms of probability theory. 

Example 1.40. Suppose that we choose the parametric family of normal dis- 
tributions with variance 1 and parameter B equal to the mean. The parameter 
space is the real line IR. Suppose that we want a normal prior distribution for G, 
but one with very high variance to indicate that we are not willing to say where 
we think 6 is with much certainty. The distribution N(a, n) for large n has this 
property. But how can we choose n? If we let n — * oo, there is no countably 
additive limit to the sequence of probability distributions. There is no normal 
distribution with infinite variance. 

What has become common in problems like Example 1.40 is to choose a 
measure A on (ffc,T) which may not be a probability but still pretend that 
it is the prior distribution of 6. That is, use A in place of /ie in Bayes' 
theorem 1.31. The "posterior" after observing X - x, if it exists, will have 
density with respect to A, 

fx\e(x\0) 
f Q fx\e(x\t)d\(tY 

The key is whether or not the denominator in (1.41) is finite and nonzero. 
If so, we can pretend that (1.41) is the posterior density of 6 given X = x 
and then proceed with whatever analysis we want to perform. In this case, 
we call A an improper prior distribution. If the denominator in (1.41) is 0 
or infinite, one may need to choose another prior distribution. 

Example 1.42. Suppose that X ~ JV(0,1) g iven e = °- We can use A ec * ual 
to Lebesgue measure as an improperpnor. Suppose that we observe only one 
observation X. Since f X \e(x\0) = (y/2n)~ l exp(-[x - 0] 2 /2), it follows that the 
posterior density with respect to Lebesgue measure derived from Bayes theo- 
rem 1.31 is equal to f X \e(x\0) as a function of 6. In other words, given X = x, 
0 has N(x, 1) distribution. 

The above discussion of improper priors is not particularly precise math- 
ematically. There are two traditional ways to make the concept of improper 
prior mathematically precise. Each of them opens its own particular can 
of worms, so we will only describe each very briefly and point the reader 
to relevant literature. First, one may remove the restriction that the prob- 
ability of a set must be at most 1. Hartigan (1983) takes this approach 
and allows sets to have infinite probability. This makes improper priors 



1.3. Parametric Models 21 



"proper," but now many traditional theorems of probability theory which 
make implicit use of the upper bound on probabilities either must be re- 
proved or fail to apply to infinite probabilities. The second approach is that 
of DeFinetti (1974), in which the requirement that probabilities be count- 
ably additive is relaxed. That is, probability is only required to be finitely 
additive. 14 Needless to say, most of the traditional results of probability 
theory need to be reproved or scrapped in this theory also. 15 The improper 
prior in Example 1.42, when thought of as a finitely additive prior, gives 0 
probability to every compact set and still gives probability 1 to the whole 
real line. Hartigan (1983, Theorem 3.5) gives a version of Bayes' theorem 
for possibly infinite probabilities. Berti, Regazzini, and Rigo (1991) prove a 
Bayes' theorem for finitely additive probabilities, as do Heath and Sudderth 
(1989). An alternative to using improper priors is to do a robust Bayesian 
analysis, as described in Section 8.6.3. 



1.3.3 Choosing Probability Distributions 

We have assumed that probability distributions represent our (or someone 
else's) opinion about unknown quantities. At least a little thought should 
be given to how those probability distributions are chosen. The most com- 
mon method for choosing a probability distribution might be called "avail- 
ability." Most people who study statistics formally for no more than one 
academic year will only be able to describe one parametric family of dis- 
tributions suitable for use with continuous data. The family of normal 
distributions is both computationally tractable and remarkably versatile 
as a model for many natural phenomena. Its versatility is due in part to 
the fact that many other distributions can easily be transformed to normal 
distributions so that the computational tractability of the normal distribu- 
tion can be widely extended. The family of transformations introduced by 
Box and Cox (1964) is a classic example. Other methods for choosing prob- 
ability distributions are based on data analytic techniques. Either the very 
data on which inference will be based or other seemingly relevant data are 
analyzed by various graphical techniques, hypothesis tests, or other pro- 
cedures in order to try to select an appropriate probability model to use 
as a description of the uncertainty surrounding the data. The most direct 
methods for selecting distributions are those based on elicitation. In such 

! 4 ^ fil }i te ,V ^ d ! tive Probability is a function // from a field T of subsets of a 
set S to [0, 1] which satisfies /x(0) = 0 and p(A U B) = fi(A) + a(B) if A 0 B = 0 
Kadane Schervish and Seidenfeld (1985) explore some of the implications of 
finitely additive probability for statistical inference. One well-known consequence 
ot using improper priors is the famous marginalization paradox reported by Stone 
and Dawid (1972) and Dawid, Stone, and Zidek (1973). 

u-v^ZHh Seidenfeld > an <* Kadane (1984) show how the law of total proba- 
bility B.70 1 fails in the finitely additive theory. Stone (1976) gives an interesting 
example of this failure. & 



22 Chapter 1. Probability Models 



methods, an expert (a term to be left undefined) is questioned about his or 
her beliefs concerning relevant random quantities, and a probability model 
for those beliefs is inferred from the responses. 

Each of the three types of methods for choosing probability distributions 
has its advantages and disadvantages. The availability method may seem 
silly as described above, but one usually does have a limited number of fam- 
ilies of distributions that one is willing to consider. The methods described 
in Section 8.6 on mixtures of models can be useful in sorting out uncertain- 
ties amongst alternative models for a given data set. In particular, robust 
methods (Sections 5.1.5 and 8.6.3) are designed to assess or even limit the 
sensitivity of inferences to specifications of distributions. 16 In short, one 
may not be forced actually to choose a single probability distribution to 
represent his or her uncertainty. Comparing the effects of various possible 
choices may be sufficient for assessing the information content of the data. 
When one is determined to use a particular parametric family of distribu- 
tions, and only the prior distribution for the parameter needs to be chosen, 
it may be the case that various alternatives make little difference and a 
choice by convenience (like an improper prior) will be sufficient. Whether 
or not this is true, considerations of Bayesian robustness will clearly be in 
order when such a choice is made. 

Data-based techniques are particularly appealing when one is forced to 
analyze someone else's data without access to subject matter expertise. 
Also, if one must use one of the popular computer packages, which tend 
to be built exclusively around only one distribution for each type of data, 
it pays to be able to transform the data into something better suited to 
be modeled by that one distribution. Quantile plots and various graphical 
techniques [see, for example, Gnanadesikan (1977)] are very useful for help- 
ing to select such a transformation. Likelihood-based methods for choosing 
a distribution can be described as follows. Suppose that there is an index 
set N, and for each a e N, there is a possible distribution for the available 
data X. Let fx, a denote the predictive density of the data X as calculated 
in (1.23) under the assumption that the distribution being used is the one 
corresponding to index a. One could then base a choice amongst the dif- 
ferent values of a on the values of g(a) = fxA x ) once X = x is observed - 
(Typically, one chooses a to maximize g{a).) This is similar to what is done 
in empirical Bayes analysis (see Section 8.4 for more details). An obvious 
drawback to all such data-based methods is that they tend to understate 
the amount of uncertainty that remains about interesting unknown quan- 
tities. The reason is that one pretends to be sure of something (e.g., which 
parametric family or which value of a) of which one really is not sure. 

Example 1.43. Suppose that X is a vector of 20 random variables and that we 
cannot decide whether to model them as IID Lap(/x, o/y/2) or IID N(^a ) given 



16 Berger (1994) reviews the literature on robust Bayesian methods up 



1.3. Parametric Models 23 



(M,£) = (/i, cr). (In this way, n and a are the mean and standard deviation in 
both cases.) We could let N be the set of all triples (i,//, cr), where i = 1 means 
Laplace and i = 2 means normal. Consider the following data values: 

-0.0820, 1.3312, -1.3518, -1.4930, 0.0850, 0.7022, 1.735, 
-0.3164, 2.1948, -0.0371, 0.3377, -0.3124, 0.6087, 0.7339, 
-0.4632, 0.3398, -0.0352, 0.1597, -0.6344, -0.4435. 

The value of a that leads to the largest fx^{x) is (1, 0.0249, 0.9473). The largest 
value is 5.943 x 10~ 12 , which is only slightly larger than the value achieved at 
a = (2,0.1530,0.8909), namely 4.772 x 10" 12 . If we decide to use the Laplace 
distribution model, we will be pretending that we were sure from the start that 
the data would be Lap(fi,a/y/2), rather than taking into account the sizable 
amount of uncertainty that still remains about the underlying distribution. 

In a classical setting, one might look at quantile plots to see whether the 
data looked more normal or more like a Laplace distribution. Figure 1.44 shows 
quantile plots for both Laplace and normal distributions. The two plots are about 
equally straight, although the Laplace plot is a little bit straighter. Choosing 
either distribution would surely be acting as if we knew something that was 
quite uncertain. 

Elicitation techniques tend to lie on the interface between statistical the- 
ory and psychology. A series of questions must be designed for interrogation 
of the expert. The responses to these questions must then be reconciled 
with the axioms of probability theory, keeping in mind the expert's limited 
motivation and/or ability to answer accurately. Much has been written in 
the psychological literature about the ability of people to assess probabili- 
ties subjectively. Kahneman, Slovic, and Tversky (1982) have compiled an 
interesting collection of articles that, among other things, illustrate and 





Laplace 


o 


Normal 




Quantiles of Distribution 



Figure 1.44. Quantile Plots for Laplace and Normal Distributions 



24 Chapter 1. Probability Models 

describe various problems people have in quantifying uncertainty. Hogarth 
(1975) surveys the early literature on the assessment of subjective distribu- 
tions. He concludes that humans are "ill-equipped for assessing subjective 
probability distributions." This conclusion might help to explain the num- 
ber of tools that have emerged since that time to better equip statisticians 
and experts to choose distributions. These tools are directed primarily to- 
ward choosing a prior distribution for the parameters of a prespecified 
parametric family. For example, Kadane et al. (1980) and Garthwaite and 
Dickey (1988) give algorithms for specifying conjugate prior distributions 
for the parameters of normal linear models. Garthwaite and Dickey (1992) 
extend their method to deal with the selection of variables in multiple re- 
gression. Freedman and Spiegelhalter (1983) and Chaloner et al. (1993) 
describe methods for eliciting prior information for use in clinical trials. 
A common feature of most prior elicitation schemes is their reliance on 
the predictive distribution (1.23) to infer the prior. The reason for this is 
that experts are more likely to be comfortable thinking about the actual 
observables of their study rather than parameters of statistical models. 
Problems 18 and 19 at the end of this chapter give some simple examples 
of how this might be done. In order to take account of the fact that experts 
may not accurately respond to the elicitation inquiries, Dickey (1980) and 
later Gavasakar (1984) described probability models for the elicitation pro- 
cess itself. In these models, the responses U which the expert will give to 
elicitation questions are modeled as data with a distribution that depends 
on the subjective distribution P being elicited. One then tries to infer P 
from U using Bayesian or related methods. One must be careful not only to 
consider the possible errors in U as answers to the elicitation questions, but 
also to consider how sensitive the inference from U to P is. For example, 
does a small change in U produce a small or a large change in PI (See the 
last parts of Problem 18 on page 76.) 

In the remainder of this text, as in most texts on the theory of statistics, 
little attention will be given to how the probability distributions are chosen. 
When a prior distribution is used for a specified parametric family, one can 
assume either that the prior was elicited by some method or other, or that 
the prior was chosen by convenience (a popular device), or that the prior 
is just one of many that will later be compared in a robustness study. 

1.4 DeFinetti's Representation Theorem 
1.4.1 Understanding the Theorems 

In this section we will state some representation theorems for exchange- 
able random quantities and give a number of examples. The proofs of 
these theorems (and some related results of interest) are deferred to Sec- 



1.4. DeFinetti's Representation Theorem 25 



tion 1.5. 17 These theorems characterize all of the possible joint distribu- 
tions for exchangeable random quantities which take values in a Borel space 
(think finite-dimensional Euclidean space), and they are essentially due to 
DeFinetti (1937). They can be summarized here as follows. If there is an 
infinite sequence of exchangeable random quantities {X n }^L^ then there 
must be some random quantity P such that the Xi are conditionally IID 
given P. If the random quantities have Bernoulli distribution, then P can 
be taken to be the limit of the proportion of successes in the first n ob- 
servations. In general, P will turn out to be the limit of the empirical 
distributions P n of X u . . . , X n , which were defined in (1.21). This explains 
how DeFinetti's theorem helps to motivate models like (1.10). 

Example 1.45. Consider the case of Bernoulli random variables {X<}gi. Here, 
X = {0, 1} and a random probability measure P on X is equivalent to a random 
variable 0 € [0, 1], where 9 = P({1}). The empirical distribution P n is equivalent 
to X n , the average of the first n observations, since P n ({l}) = X n . Theorem 1.47 
(a special case of Theorem 1.49) will say that X n converges to 6 a.s., and that, 
conditional on 0 = 0, the Xi are IID Ber{6) random variables. This is what we 
meant on page 7 when we said that "0 is given an implicit meaning as a random 
variable 6, rather than a fixed value." Also, the random variable 6 will have a 
distribution, which is the measure /u in (1.10). 

The heavy mathematics in the proof of Theorem 1.49 is required to make 
precise what it means to have a random probability measure P and what it 
means to condition on such a thing. For random quantities that assume only 
finitely many different values, random probability measures are equivalent 
to finite-dimensional random vectors. For more general random quantities, 
random probability measures can be more complicated. For this reason, we 
prove Theorem 1.47 first, even though it is a special case of Theorem 1.49. 
The proof of Theorem 1.47 contains the essential ideas of the more com- 
plicated proof without being encumbered by so much mathematics. 

If there are only finitely many exchangeable quantities X U ...,X N , then 
all that we can prove is the following. Conditional on the empirical distri- 
bution P N ofX u ...,X N , every ordered n-tuple (for n < N) of the X { has 
the distribution of n draws without replacement from a finite population 
with distribution P N . It is the "without replacement" qualifier that pre- 
vents us from proving that X u . . . , X N are conditionally independent. (See 
Example 1.18 on page 10.) It is possible for a finite collection of exchange- 
able random variables to be conditionally independent; however, it is not 
necessary. Looking at the Bernoulli case first might aid in understanding 

17 In most of this text, proofs are given immediately or almost immediately 
after the statements of results. Because DeFinetti's representation theorem 1 49 
is so important for motivating statistical modeling, and because its proof involves 
some rather heavy mathematics, many readers may wish to forego reading the 
proofs on a first pass through this material. However, every reader should at least 
try to understand what Theorem 1.49 says. 



26 Chapter 1. Probability Models 



the finite case theorem. 

Example 1.46. Let Xi,...,X;v be exchangeable Bernoulli random variables. 
Let the word "success" stand for X» = 1, and let the word "failure" stand for 
Xi — 0. Let the word "trial" stand for one of the X*. Since there are only 
finitely many (2 N ) possible values for the vector (Xi, . . . the entire joint 

distribution can be specified by giving probabilities to all of those 2 N vectors 
of 0s and Is. If Xi, . . . , Xn are exchangeable, however, many of the vectors will 
have the same probability. For example, the N vectors with exactly one success 
and N - 1 failures all have the same probability. Similarly, the vectors with 
exactly two successes and N - 2 failures all have the same probability. In fact, 
for each m = 0, . . . , AT, all (^) vectors with exactly m successes and N — m 
failures have the same probability. Since the total number of successes in all 
N trials plays such an important role in the distribution, we give it a name, 
M. Let p M = Pr(M = m) for m = 0, . . . , N. Then, the probability of each 
vector with exactly m successes and N -m failures is Pm/(^). All probabilities 
associated with the joint distribution of Xi, . . . , Xn can be calculated from these 
values. For example, suppose that we let K equal the number of successes in 
a particular collection of n trials (for example, the first n, or the last n, or 
every other one from the first 2n, etc.). Then Pv(K = k) can be calculated 
by adding up all the probabilities of the vectors for which the particular n trials 
include exactly k successes. This is nothing more than a straightforward counting 
argument, if we first partition the vectors according to the value of M. For each 
ra = fc, . . . , iV — n + fc, there are (£) vectors with M = m and with exactly 
k successes on the particular n trials of interest. It follows that 

N-n+k / n \ /N-n\ 

Pr(K = k)= £ Kk, pr k, Pm . 

m=k \m) 

This last expression is easily recognized as a mixture of hypergeometric prob- 
abilities Hyp{N,n,m) with mixing weights p m . That is, it appears as if K 
has Hyp(N,n,m) distribution conditional on M = m and M has distribution 
(p 0 , • • • ,Pn). Note that the Hyp(N,n,m) distribution is the distribution of the 
number of successes in n draws without replacement from an urn containing m 
successes and N - m failures. Also, the random variable M is equivalent to the 
empirical distribution Pjv. 

Example 1.46 above is the proof of the finite version of DeFinetti's theo- 
rem for Bernoulli random variables. It is also illustrative of the most general 
form of the finite version. The total number of successes must be replaced 
by the empirical distribution P N , and then we have that Xi, . . . , X N are 
exchangeable if and only if the conditional distribution, given P N , of ev- 
ery finite subcollection X h , . . . , X in is the distribution of n random draws 
without replacement from a population with distribution Pw- 

1.4.2 The Mathematical Statements 
The Bernoulli case is simple enough to state without introduction. 
Theorem 1.47 (DeFinetti's representation theorem for Bernoulli 
random variables). An infinite sequence {X n }™ =1 of Bernoulli random 



1.4. DeFinetti's Representation Theorem 27 



variables is exchangeable if and only if there is a random variable © taking 
values in [0, 1] such that, conditional on O = 6, {X n }^ =1 are IID Ber(6). 
Furthermore , if the sequence is exchangeable, then the distribution of © is 
unique and Y%=\ Xi/n converges to © almost surely. 

For the more general cases, we need some more notation. Let {X, B) be 
a Borel space, and let V be the set of all probability measures on (X,B). 
The theorems stated below will give the conditional distributions of certain 
random quantities taking values in X given certain probability measures. 
To be mathematically precise, these probability measures must themselves 
be random quantities. That is, we will need a cr-field C-p of subsets of 
V such that the appropriate probability measures can be thought of as 
measurable functions from some probability space (S,*4,/i) to (V,Cv)- Let 
C-p be the smallest cr-field of subsets of V containing all sets of the form 
A B ,t = {P G V : P(B) < t}, for B G B and t G [0, 1]. This is the smallest 
cr-field for which the evaluation functions gs ' V — ► IR are measurable, 18 
where g B (P) = P(B). It is easy to show that P n , defined in (1.21), is a 
measurable function from the n-fold product space (X n , B n ) to (V, C v ) (see 
Problem 24 on page 77). If (S,A,fi) is a probability space, a measurable 
function P : S — » V is called a random probability measure. In this way, P n 
is a random probability measure for every n. 

1.4.2.1 The Finite Version 

In order to state the finite version of DeFinetti's theorem, we will find it 
convenient to refer to random samples from the empirical distribution of a 
collection of random variables X U ...,X N . What we mean by this is the 
following. Suppose that X { = x { for i = 1, . . . , TV. Create an urn with N 
balls labeled x\, . . . ,x N . A simple random sample of size n with/without 
replacement from the empirical distribution P N of X u . . . ,X N is n draws 
with/without replacement from this urn such that, on each draw, every 
ball in the urn has equal probability of being drawn. 

Theorem 1.48. Suppose that X U ...,X N are random quantities taking 
values in a Borel space (X,B). Let X = (X u . . . ,X N ), and for each B G 
B, let P N (B) = Y^i=,i I B{X i )/N be the empirical distribution of X. The 
random quantities are exchangeable if and only if, for every ordered n-tuple 

* 8 Those familiar with topological concepts will recognize the sets Ab t as a 
subbase for the topology of pointwise convergence of functions from tf'to IR, 
which is also the product topology when that set of functions is considered as the 
product space IR . As such, (V,Cv) is not a Borel space. This inconvenient cir- 
cumstance will not cause problems for us, however. One of the steps in the proof 
of Theorem 1.49 is to show that the subset of V in which P lies is the image 
of a Borel space (X 00 ^ 00 ) under a measurable function. Hence regular condi- 
/v^o al »^ ribUti ° nS are induced on ( p > c ^) b y th e corresponding distributions in 

(X ,D ). 



28 Chapter 1. Probability Models 



(ii, . . . , i n ) of distinct elements of {1, . . . , N}, the joint distribution of 

. . . , X in ), conditional on Pjv = P is that of a simple random sample 
without replacement from the distribution P. 

1.4.2.2 The Infinite Version 

The infinite version of DeFinetti's theorem is the following. 

Theorem 1.49 (DeFinetti's representation theorem). Let (S,A,n) 
be a probability space, and let {X,B) be a Borel space. For each n, let 
X n : S — » X be measurable. The sequence {Xn}™^ is exchangeable if 
and only if there is a random probability measure P on B) such that, 
conditional on P = P, {X n }J° =1 are IID with distribution P. Furthermore, 
if the sequence is exchangeable, then the distribution of P is unique, and 
P n {B) converges to P(B) almost surely for each B e B. 

In Section 2.4, we present a more general theorem of Diaconis and Freedman 
(1984) and Lauritzen (1984, 1988), which applies to sequences of random 
quantities that are not exchangeable. This Theorem 2.111 will actually 
imply Theorem 1.49 as a special case, but its proof is far more complicated 
than the proof of Theorem 1.49, which we give in Section 1.5. 



1.4.3 Some Examples 

Here, we present more examples of exchangeable sequences and the impli- 



cations of DeFinetti's theorem. 



Example 1.50. Suppose that Xi, . . . , X N are IID Ber(r) random variables. If 

M = Ef=i X ^ then Pr ( M = m ) = (m)^ 1 " r ) N " m for m = 0, . . . , AT. That 
is, M has Bin(N,r) distribution. Theorem 1.48 says that 



m=k X 7 

= 2( Ar / n y + * (i " r) ""'~ fc 

(i_ r )» 2_/ ( e y (i-o 

e=o ^ ' 



= r 
= r h (l-r) n - k 



which corresponds to the X t being IID Ber(r). Hence, the probability of observing 
k ones in n trials is (£)r fc (l - r) B ~ fc , the binomial probability. 

The following example helps to explain why Theorem 1.48 is not used 
very often with random variables having continuous distribution. 



1.4. DeFinetti's Representation Theorem 



29 



Example 1.51. Suppose that Xi,...,Xjv are exchangeable with continuous 
joint CDF Fxj,...,*^. The conditional distribution of X\>...,X n given Pjv is 
that of a simple random sample without replacement from Pjv, but the distri- 
bution of Pn is not simple. If we let B\ , . . . , B k be a partition of X, the joint 
distribution of V = (P^(Bi), . . . , P N (B k )) can be expressed formally as follows. 
For each vector (ii, . . . , i k ) such that ij = N, 

Fr ( V = (£' * * ' ' £)) = / ^(^' ' ' ' ' 

where ^ is the union of the ( <lf " <fc ) product sets of the form B Sl x • • • x £ 5JV , 
where the subscripts si, . . . , sn are integers from 1 to fc with j appearing ij times 
for each j. Needless to say, this formulation will not get us very far in general. 

Example 1.52. This example is due to Bayes (1764). Suppose that {X n }^Li are 
exchangeable Bernoulli random variables, and we set 

Pr(fc successes in n trials) = — ~ -, for k = 0, . . . , n and n = 1, 2, . . . . 

n -f 1 

To check that this gives a consistent set of probabilities, we must show that, for 
every n and every n-tuple (xi, . . . , x n ) of elements of {0, 1}, 

Pr(Xi=xi,...,X n =x n ) = Pr(X 1 =x 1 ,...,X n = x n ,X n+l =0) 

+ Pr(Xi =ii,...,Xn = i„,X n+ i = 1). 

To show this, let k = Xi . Then, the left-hand side equals 1/ f(n 4- 1)(?)1. 
The right-hand side equals V ; J 

I , 1 _ 1 

(" + 2)(T) (n + 2)(2+I) (n + DG)' 

To figure out what P is, recall that P n ({l}) = X„, the proportion of successes 
m the first n trials, and lim^ P n ({l}) = P ({1}). Since X n converges a.s. to 
0, it converges in distribution to 6 by Theorem B.90. Let F n (t) = Pr(X n < t) 
be the CDF of X n . Write ~ 

F n (t) = Pr(at most n* successes in n trials) = ^ + 1 

; n + 1 ' 

where [xj denotes the greatest integer less than or equal to x. It is trivial to see 
that lim n _ 00 (Lnt[ + l)/(n + 1) = t, for all 0 < t < 1. Hence, = t is the CDF 
of 8 = limn-.ooXn. That is, the X* are conditionally IID Ber(O) given 9 = 0 
and 0 has U(0, 1) distribution. 

Example 1.53. Suppose that, for each n, the joint density of X u . . . ,X n is 
. • • ,*») = wfi Q v^r )b . . , for all Xi > 0. 

Clearly such random variables are exchangeable for each n. It can be seen that 
these densities are consistent also. (See Problem 20 on page 76.) Let us try to 



30 Chapter 1. Probability Models 



find the distribution of the limit of the empirical probability measures, Jr n . 
Pr(P n ((-oo,c]) <t) = Pr(at most [tn\ Xi are < c) 

= Pr (exactly fc Xi are < c) 



= ^ Q Pr(Xi, . . . ,X fc < c,X fc+ i, . . . ,X n > c), 

where i — [tn\ . The probability that the first k Xi are at most c while the rest 
are greater is 

ndr^^ dXn " dxi 



= £<-*> 




j/ \6+(n- fc + j)c 



= E*- 1 )' (*) /°° r^) exp( ~* [6 + (n " k + j)cl)d * 

= jf °° ^ a_1 exp(-*[& + cn])[exp(c2) - \} k dz. 

Multiplying this last expression by (£) and summing over k = 0, . . . I give 
f°° h a 

/ jZ-tz— 1 exp(-bz) Pr(y n ,z < 
Jo r W 

where Y n ,* is a random variable with Bin(n, 1 -exp[-C2]) distribution. For each 

z, 

/-wr ^ „ N [ 1 if 1 - expf-cz) < t T / x 

Jim Pr(r n ,, < /) = | o if 1 - exp(-cz) > t = '(o.-iorfi-o/dM- 

So, the CDF of the limit P((-oo,c]) of P n ((-oo,c]) is 

I C ^ 2 °- 1 exp(-6^, 
Jo r ( a ) 

which is the CDF of 1 - exp(-cG), where 9 ~ T(a,6). That is, it is as if there 
were a random variable 6 ~ T(a, b) and P((-oo, c]) = 1 - exp(-c0). Put another 
way, it is as if the Xi were conditionally IID with Exp(9) distribution given 9 = 0 
and 6 ~ r(a,6). That this is indeed the case can be proven. (See Problem 22 on 
page 77.) 

On page 18, we showed how to calculate conditional distributions for 
future observations given the ones that have already been observed us- 
ing parametric families. For exchangeable random variables, we can often 
find these posterior predictive distributions without even introducing the 
parameter. 



1.4. DeFinetti's Representation Theorem 31 



Example 1.54 (Continuation of Example 1.52; see page 29). In the case of 
Bernoulli random variables, it is not difficult to calculate conditional probabil- 
ities. Suppose that we observe k* successes in the first n* trials, and we are 
interested in the probability of k successes in the next n trials. It is straightfor- 
ward to calculate the probability of k successes in next n trials given k* successes 
in the first n* trials as 



(»;+») „•+„+!■ 

bability of i 
is is 

1 



For example, we get that the probability of k successes in the next n trials given 
two successes in the first five trials is 



c:: 2 >+ 6 ' 



(1.55) 



It is easy to see that the future trials are still exchangeable given the past, and one 
could use the distribution in (1.55) to find the distribution of 0 given the observed 
trials, just as we found the original distribution of 0 to be 11(0, 1). Alternatively, 
we have a theorem that applies to all exchangeable Bernoulli sequences. 

Theorem 1.56. Suppose that {Xi}^ is an infinite sequence of exchange- 
able Bernoulli random variables. Let 0 = lim n _^oo YJi=\ Xijn, and let /z e 
be the distribution of 0. Conditional on seeing k* successes in n* trials, 
the distribution of 0 has CDF 

F*(t) = W^-^'^eW 
' J^(l-tf)»--*-d/ieWO ' 

Proof. We already know that the X { are conditionally IID with Ber(6) 
distribution given 0 = 0. It follows from the definition of conditional prob- 
ability that, for every Borel subset B of [0, 1] and every n* and 0 < fc* < n*, 

Pr(/c* successes in n* trials and 0 e B) = J (^^e k \l-6) n * ~ k * d^(6). 
Dividing this by 

Pr(/c* successes in n* trials) = j ^ (1 - ip) n *- k * d/j, e {ip) 
completes the proof. □ 

Example 1.57 (Continuation of Example 1.54). After observing k* successes in 
n trials, we can find the conditional distribution of 9 from Theorem 1 56 For 
example, if n* = 5 and k* = 2, 0 has conditional density 

rW = 6O0 2 (l-0) 3 

with respect to Lebesgue measure. The probability of a success on the sixth trial 
given that we observed two successes in the first five trials is then 



L 



o ' 



32 Chapter 1. Probability Models 



which agrees with (1.55) with n = 1 and k = 1. After observing h* successes in 
n* trials, the distribution of B has density 

(n* + l)! a**, m n*-fc' 

which is the density of a Beta(k* + 1, n* - fc* + 1) random variable. Given the 
data, the probability of success on the next trial is the conditional mean of O, 
namely (k* + l)/(n* + 2), which is approximately fc*/n* if n* is large. 

This example helps to illustrate why observed frequencies are relevant for calcu- 
lating probabilities, if the observations are thought to be exchangeable. Suppose 
that the distribution fie has a density / with respect to some measure on [0, 1]. 
Then the conditional density of 6 given fc* successes observed in n* trials will 
be a constant times 9 k * (1 - 9) n *~ k * f(9). This density is higher for 9 values near 
the observed k* /n* than is /. As n* gets larger, this becomes more pronounced 
to the point of the density resembling a huge spike near fc*/n*, if / is strictly 
positive in the vicinity of this value. This argument is the heuristic justification 
for the common practice of estimating 0 by k* /n*. The justification only applies 
when we believe the trials to be exchangeable. We do not have to believe that 
there exists a "fixed value" 9 such that the trials are IID Ber(9). We are just 
trying to estimate (or predict) what the limit of the relative frequencies will be. 

Example 1.58 (Continuation of Example 1.53; see page 29). In the case of 
Bernoulli random variables, we saw that learning more data helped us to learn 
the value of O more precisely. If learning more data is supposed to help us to learn 
P in the present example, then the conditional distribution of X n +i given the 
first n observations should approach P. The conditional density of X n +i given 
the first n observationsis 

r(a + n + l){b + J2i=i*i) a+n 
r(a + n)(fc + x + ^ SBl Xi)«+ w + 1 

where a* = a + n and b* = b + Y™=i £i- Suppose that a* /b* converges to 9 as 
n — ► oo. Then, the conditional density above converges to 0exp(-x0), an Exp(9) 
density. Once again, it is as if P is the CDF of the Exp(S) distribution for some 
random variable 0. This is like saying that the distribution of P is concentrated 
on the set 

E = {P : P((-oo,x]) = 1 - exp(-z0), for some 9 > 0}. 

In Problem 22 on page 77, you can prove that the probability is distributed over 
E as follows. Consider the mapping V : IR+ -> E defined by V(9) = F e , where 
F 0 ( x ) = i _ exp(-x9) for all x > 0. To determine the appropriate probability 
measure on IR+, we first note that the conditional distribution of the future 
given the past depends on the past only through £" =1 Xi. We suspect that some 
function of this converges in distribution to 0. Solve Problem 21 on page 76 to 
determine the appropriate function and the limiting distribution. The measure 
induced on E by V is the distribution of P, /xp. Integration in the space V is 
performed by integrating over M: 

/ g(P)dfip(P) = [ 9(V(9))d9. 
Jb Jv-Hb) 



fx n+1 \X 1 ,... 1 X n (x\xi ) ...,X n ) = 



1.5. Proofs of DeFinetti's Theorem 



33 



The function V~ l gives us a way of dealing with P as if it were the real num- 
ber 6 = V~ 1 (P). This is a special case of a parametric index, to be defined in 
Definition 1.85. 

1.5 Proofs of DeFinetti's Theorem and Related 
Results* 

1.5.1 Strong Law of Large Numbers 

Because the infinite forms of DeFinetti's theorem state that certain pro- 
portions converge almost surely, we will need to prove a strong law of large 
numbers for exchangeable random variables. 19 The strong law of large num- 
bers for IID random variables 1.63 says that, for a sequence of IID random 
variables with finite mean, the sequence of averages of the first n of them 
converges almost surely to the mean. As stated, this result is clearly false 
for exchangeable random variables. For example, let X have finite mean 
but nondegenerate distribution, and suppose that Xi = X for all i. Then 
{^n}$£Li is clearly an exchangeable sequence, and the average of the first 
n is equal to X for every n. Hence, the averages converge almost surely to 
X, not the mean. For exchangeable random variables, we can only prove 
that the averages converge to some random variable which might not be 
constant. 

We will prove two versions of the strong law for two different sets of 
readers. The first version, Theorem 1.59, is based solely on elementary 
probability theory, but it concludes only that a subsequence of the sequence 
of sample averages converges almost surely, 20 and then only under the 
assumption of finite variance. But the restricted result of Theorem 1.59 
is all that is needed in the proof of Theorem 1.49. The second version, 
Theorem 1.62, is based on the theory of reversed martingales and is a more 
complete statement of the strong law of large numbers than Theorem 1.59. 



^This section may be skipped without interrupting the flow of ideas. 

One consequence of DeFinetti's representation theorem 1.49 will be that 
many theorems that apply to IID random quantities can be adapted to apply 
to exchangeable random quantities. Taylor, Daffer, and Patterson (1985) prove 
some examples of such theorems. See also Problem 31 on page 78 and Problem 39 
on page 79. One such result would be the strong law of large numbers 1.62. 
Unfortunately, one of the steps in the proofs of Theorems 1.47 and 1.49 makes 
useof a strong law of large numbers for exchangeable random variables. 

Hartigan (1983, Theorem 4.6) simplifies his proof of DeFinetti's theorem 
using this fact. The statement of Theorem 1.59 can be extended to show that the 
entire sequence of sample averages converges almost surely as in Problem 38 on 
page 78. 



34 Chapter 1. Probability Models 



1.5.1.1 An Elementary Version 

We state and prove here an elementary version of the strong law of large 
numbers for exchangeable random variables, which is only general enough 
to allow us to prove Theorem 1.49. Those who desire a more complete 
statement of the strong law of large numbers and who have some famil- 
iarity with martingales may safely skip this theorem and proceed to the 
martingale version in Section I.5.I.2. 21 

Theorem 1.59. 22 Let (S,A,n) be a probability space and, for each n, 
let X n : S — > JR be exchangeable random variables. Assume that — oo < 
E(XiXj) = ra 2 < oo for all i ^ j and E(X 2 ) = ^2 < oo for all i. Let 
Y n = 5Zr=i Xi/n. Then the subsequence {Ygk}™^ converges almost surely. 

Proof. We will prove that the subsequence converges almost surely by 
proving that it is a Cauchy sequence 23 almost surely. Use Tchebychev's 
inequality B.16 to write, for m > n, 



Pr(|y m -y n |>c) < 



E(y m - y n ) 2 



= (E(y n 2 ) + E(y^)-2E(y n y m ))^ 

= l A K*2 + n 2 (n 2 - l)m 2 ] 

+ -K:\mii2 + m{m - l)m 2 ] 

2 \ 1 

[nfi 2 + n(n - l)ra 2 + n(m - n)m 2 ] 

mn I c 

= fLl\ ^ 2 - m2 < M2~m 2 (160) 
\n m J c 2 nc 2 

Now, let Z k = Y s h and A k = {s : \Z k +i - Z k \ > 2~ k }. It follows easily from 
(1.60) (with c = 2~ fc ) that, for every fc, Pr(A fc ) < (/i 2 - m 2 )2 fc . Now, let 
A = 0^ =1 U£L n A fc . It follows from the first Borel-Cantelli lemma A.20 that 
Pr(A) = 0. We n finish the proof by showing that, for every s G A c , and every 
e > 0, there exists JV S)£ such that n,m > AT a>e implies \Z n {s) - Z m (5)| < e. 



21 Two theorems (7.49 and 7.80) make use of the strong law of large num- 
bers for IID random variables 1.63. This result is available as a consequence of 
Theorem 1.62, but not from Theorem 1.59. 

22 This theorem is used in the proof of Theorem 1.49. Its proof resembles the 
proof of a similar claim in Loeve (1977, Section 6.3). 

23 A sequence of real numbers {z n }~ i is Cauchy if, for every e > 0, there exists 
N such that n, m > N implies \x n -x m \<e. Since IR is a complete metric space, 
every Cauchy sequence converges. 



1.5. Proofs of DeFinetti's Theorem 



35 



Write A c = UjJLx n£L n A%. For every 5 G A c , there exists c s such that 
s € ngi j4£. If m > n > c 5 , it follows that 



m— 1 



\Z m (a) - Z n (s)\ < £ \Z i+1 (s) - Zi(s)\ < 2-" +1 < 2~ c ° +1 . 

So, let AT s e > 1 + max{c 5 , — log 2 e} to finish the proof. □ 
There are strong laws of large numbers for IID random variables also. 
Here is one which will help in the proof of Theorem 1.49. 

Lemma 1.61 (Strong law of large numbers: bounded condition- 
ally IID case). Let {Xnj^Lj be a sequence of bounded random variables, 
and let 0 be an arbitrary random quantity such that, conditional on 0, 
the X n are IID with mean c(0). Then Y17=i Xi/n = Yn converges almost 
surely to c(9). 

Proof. Since Y n converges almost surely to c(9) if and only if Y n - c(0) 
converges almost surely to 0, and since Y n - c(0) is the sample average of 
the first n of X { - c(0), assume that c(0) = 0 without loss of generality. 
Now, write 

^ / n n n n \ 

= r? ( ^ ^ ^ ^ X^X^X^X^ I • 

\i 1 =l i 2 = l i 3 = l i=4=l / 

Each of the terms above, for which at least one of i u i 2 , i 3 , u is not repeated, 
has mean 0 because the random variables are conditionally independent 
with mean 0. Let M be a bound for |X n |. It follows that 

E(y n 4 |©) = ±E(xi\e) + 3^(E(x*\e))* < 

ft n yi 

So > E i Y n ) < 4M 4 /n 2 by the law of total probability B.70. It follows from 
the Markov inequality B.15 that, for each e > 0, 

Pr(|y n |> e ) = Pr( y n 4> e 4 ) <^ < 

e 4 n 2 

So > 12n=i Pr (\ y n\ > e) < 00. The first Borel-Cantelli lemma A.20 implies 
that Pr(|y n | > e infinitely often) = 0. Since the event that Y n converges to 
0 is ngLJirnl > l/k infinitely often} 0 , it follows that Y n converges to 0 
almost surely. □ 



36 Chapter 1. Probability Models 



1.5.1.2 A Martingale Version" 1 " 

A more complete proof of the strong law of large numbers for exchangeable 
random variables is borrowed from Kingman (1978). 24 

Theorem 1.62 (Strong law of large numbers). 25 Let (S,A,ii) be a 
probability space, and let Xi : S — > JR be measurable for all i such that the 
Xi are exchangeable with E(\Xi\) < oo for all i. Then there exists a a-field 
Aoo such that Y n = Yh=i ^%l n converges almost surely to E(Xi| w 4 00 ). 

Proof. Define X = (X u X 2 , . . .)• F ° r n > 0, let C n be the collection of all 
Borel subsets A of IR°° which satisfy x € A if and only if y € A for all y 
that agree with x after coordinate n and such that the first n coordinates 
of y are a permutation of the first n coordinates of x. It is not difficult to 
show that C n is a cr-field, and it is trivial to see that f(x) = Yl7=i x * * s 
measurable with respect to C n (Problem 36 on page 78). Let A n = X~ l (C n ) 
and Z n = E(Xi\A n )- Since E(|Xi|) < oo and {.A n }£Li is a decreasing 
sequence of cr-fields, it follows from Part II of Levy's theorem B.124 that 
lim n _+oo Z n — E^il^oo) and is finite a.s. We now show that Y n = Z n . 
Since f{x) = x i ls measurable with respect to C n , we need only prove 
that, for A € A n , E(lAY n ) = E(I A Xi). But, I A Xi has the same distribution 
as I a Xj for alH, j = 1, . . . , n by the assumption of exchangeability and the 
permutation symmetry of the set A. Hence E(IaX\) = Y17=i ^{^AXij/n = 
E(I A Y n ). ^ □ 
As a corollary, we also mention the usual strong law of large numbers for 
IID random variables. 

Corollary 1.63 (Strong law of large numbers: IID case). 26 Suppose 
that {X n }J° =1 is a sequence of IID random variables with E(Xi) = \x. Then 
]CiLi Xi/n = Y n converges almost surely to \i. 



1.5.2 The Bernoulli Case 

The proof of DeFinetti's representation theorem for finitely many Bernoulli 
random variables X u . . . , X N was given in Example 1.46. There, we saw 
that the conditional distribution of the number K of successes in n trials 
given M — m was hypergeometric Hyp{N,n,m), where M = £i=i X^ So, 



+ This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 

24 This proof is also similar to one given for the case of IID random variables by 
Doob (1953, Section VII, 6). Those who are unfamiliar with martingale theory 
may safely skip this section and study the elementary version given earlier. But 
these readers should be aware that two theorems (7.49 and 7.80) do make use of 
Corollary 1.63. 

25 This theorem can be used in the proof of Theorem 1.49. 
26 This corollary is used in the proofs of Theorems 7.49 and 7.80. 



1.5. Proofs of DeFinetti's Theorem 



37 



for example, 

( n )( N ~ n ) 

Pr(K = k\M = m) = — V ™~ fc; . (1.64) 

Suppose that N — > oo in such a way that M/N -* 9. For fixed n and fc, 
we can take limits in (1.64) as N — > oo and m/n — > 0. Formally, we would 
get 

-k 



Pr(tf = fc|G = 6) = (fyo k (l - 0) n - 



which is the model for K ~ Bin(n,9). In fact, this is what Theorem 1.47 
says is the case. The precise proof is a bit more complicated than the 
heuristic argument above, but the idea is the same. 
Proof of Theorem 1.47. The "if" direction is simple and is left to the 
reader. For the u only if" direction, assume that {Xn}^ is an exchangeable 
Bernoulli sequence. Let Y e = Y?i=i x i/t for 1, 2, .... By the strong law 
of large numbers 1.62 or 1.59, we know that Y 8 n converges almost surely. 
Let G denote the limit when the limit exists, and let G = 1/2 when the 
limit does not exist. Let /x 0 denote the distribution of G (a probability 
measure on [0, 1].) 

The main step in the proof is to show that for every integer fc and every 
3i,->>>3k€ {0, 1} and every Borel subset C of [0, 1], 

Pr(*i=ji,...,X fc =jfc,GeC)= / 0 y (l-0)*- y d/* e (0), (1-65) 

Jc 

where y = j t + . . . + j k . To show this, let Z n = I C {Q)Y&{1 - Y^) k ~y 
and let Z = / C (G)G^1 - Q) k ~y . It is easy to see that Z n - Z a.s, hence 
Z n Z by Theorem B.90. Since Z n is uniformly bounded, E(Z n ) E(Z). 
The right-hand side of (1.65) is just E(Z). So, we need only show that 
E(Z n ) converges to the left-hand side of (1.65). Let m = 8 n , and define 
W *t = 7 {it}(^)> for each integer *=!,...,*. Then 



1 m , 

±Yw et = l Ym if * = 1 > 

m ti 1 l-^m if Jt = 0. 



With this notation, we can write 

-j th m k 

»i=l i fc =lt=l 

all i t distinct t=l . , . , . 

at least two i t equal C ~ A 

The first sum on the right-hand side of (1.66) has m!/(m - fc)! terms. 
Since E[/ c (6) flt = i w h,t) equals the left-hand side of (1.65) when all i t 



k 



38 Chapter 1. Probability Models 



are distinct, and since m!/[(m - k)\m k \ converges to 1, the mean of the 
first term on the right of (1.66) converges to the left-hand side of (1.65). 
The second sum has m k - m\/(m - k)\ terms, each of which is bounded 
between 0 and 1. Since 1 - m!/[(m - k)\m k ] converges to 0, so does the 
mean of the last expression in (1.66). This completes the proof of (1.65). 

Equation (1.65) is exactly what it means to say that X\,...,X k are 
conditionally IID Ber(0) given 0 = 0. To see that the distribution /xe is 
unique, let C = [0, 1] in (1.65) and note that this equation determines the 
means of all polynomial functions of 0. Since polynomials are dense in the 
set of all bounded continuous functions on [0, 1] by the Stone- Weierstrass 
theorem C.3, it follows that (1.65) determines the means of all bounded 
continuous functions of 0, and Corollary B.107 says that the means of 
all bounded continuous functions determine the distribution. To finish the 
proof, we note that since {X n }J° =1 are bounded and conditionally IID, 
Lemma 1.61 says that {Vi}^ converges a.s. Obviously, the limit must be 
0, a.s. □ 



1.5.3 The General Finite Case* 

1.5.3.1 Proof of Theorem 1.48 

Define a function h : X N — * V by h(x) = P x , where 

p x (B) = jjjriB(xi) 

i=l 

if x = (x\, . . . , xn) and B € B. We refer to P x as the empirical distribution 
of x. It is easy to check that the function h is measurable (see Problem 24 
on page 77.) Simple random samples with/without replacement from P x 
can be defined in exactly the same way as they were for samples from Pjv 
in Section 1.4.2.1. In fact, if X = (Xi, . . . , X N ), then P x = Pn 

Even though the space (V,Cv) is quite complicated, the subset of V in 
which Piv lies is relatively simple. P^ concentrates all of its mass on at 
most N different points in X. Hence, it is not nearly as complicated an 
object as it may appear. In fact, it is really nothing more than two vectors 
of equal length (at most JV), where the coordinates of one of the vectors are 
elements of X and the coordinates of the other are nonnegative multiples 
of l/N adding to 1. 

The following lemma will be used to help prove Theorem 1.48 and an 
approximation in Theorem 1.70. 

Lemma 1.67. Suppose that X = (X U ...,X N ) are exchangeable random 
quantities. Let Q x be the probability measure giving their joint distribu- 
tion on {X N , B N ) . Forn<N,x = (x u ..., x N ), and BeB n , let H n {B\x) 



*This section may be skipped without interrupting the flow of ideas. 



1.5. Proofs of DeFinetti's Theorem 



39 



stand for the probability that n draws without replacement from an urn con- 
taining balls labeled xi,...,xn form a point in B. Also, let M n {B\x) stand 
for the probability that n draws with replacement from an urn containing 
balls labeled xi,...,xn form a point in B. Let Fx, . . . , Y/v be conditionally 
IID given X — x with distribution Mn(-\x). Then 

Pr{(X il ,...,X in )eB,XeC) = [ H n (B\x)dQ x (x), 

Jc 

Pr((Y il ,...,Y in )eB ) XeC) = f M n {B\x)dQ x (x), 

Jc 

for each B £ B n and each C € B N . 

PROOF. The second equation above is immediate from the definition of 
conditional distribution. Next, notice that H n {B\x) can be written 

Hn{B\x) = -L J2 I B{x jl ,...,X jn ). 

V n ) All distinct ( j x , . . . , j n ) 

By the exchangeability of X u . . . , X Nl we have 
Pr((X h ,...,X in )eB,X eC) 

n distinct 

Ch> • • • , jn) 

, Jc 

distinct 

• • • >3n) 

= / #n(£|*)dQx(x). D 



We are now in position to prove Theorem 1.48. 
Proof of Theorem 1.48. The "if" part is fairly straightforward and left 
to the reader. Only the n = N case is needed, since the others follow from 
it by taking marginal distributions. 

For the "only if part, assume that X u . . . , X N are exchangeable, and 
let P N be as defined. For each probability P with support on at most N 
points and probabilities of the form k/N, let H' n (B\P) be the probability 
that n draws without replacement from P forms a point in B e B n . What 
we need to prove is that for all n and all distinct i u . . . , i n and all B e B n 
and C G Cp, 



Pr((X il ,...,X <n )€B,P jV GC)= / H' n {B\P)dii N {P), (1.68) 

Jc 



40 Chapter 1. Probability Models 



where /xjv is the probability measure giving the distribution of P;v- Let 
h : X N — ► V be the function that maps a point x = (xi, . . . to the 
empirical distribution of a:. That is, /i(x) = P x and = Pjv- Let Qx be 
the distribution of X. It follows that /i^ is the measure induced on (P, Bp) 
from Q x by h. Also, it follows that H' n (B\P) = # n (£|z) for all x such 
that = P. This means that if n (B|x) is a function of /i(x), namely 
H' n (B\P x ). Now, write the left-hand side of (1.68) as 



where the first equality follows from Lemma 1.67, and the second follows 



1.5.3.2 An Approximation Theorem 

Some questions naturally arise when comparing the cases of finitely many 
and infinitely many exchangeable random variables. First, what happens 
when N becomes infinite? Second, is there any sense in which the finite 
or infinite N cases approximate each other? The following lemma bounds 
the difference between probabilities calculated under sampling with and 
without replacement from a finite set and is useful in addressing these 
approximation questions. 

Lemma 1.69. Suppose that we have an urn with N balls labeled yi, . . . , yjv- 
Let y be the set of distinct items in the set of labels. Let X = (X\, . . . , X n ) 
be values of the labels for n draws without replacement from the urn. Let 
Y = (Yi, . . . , Y n ) be values of the labels for n draws with replacement from 
the urn. Let P be the distribution ofY and let Q be the distribution ofX. 
Then sup A cyn \P(A) - Q(A)\ < n(n - l)/(2N). 

Proof. 27 First, suppose that there are no duplicate labels. Then both 
P and Q are constant on the set A = {x : Q({x}) > 0} and on the 
complement of this set. It follows that this set and its complement are 
where the supremum of the difference occurs. The difference between Q(A) 
and P(A) is easily seen to be 1 -P{A) = 1 -n\(^)/N n . If a* € (0, 1) for all 
i = 1, . . . , n, then it is easy to show (say, by induction) that 1 - X^ =1 Xi < 

nLiU - *<)• lt is clear that nl (n)/ Nn = niLiC 1 - hence 




from Theorem A.81. This proves (1.68). 



□ 



n 



(n-1) 




The result now follows by subtracting both sides from 1. 



27 Freedman (1977) gives the first part of this proof. 



1.5. Proofs of DeFinetti's Theorem 



41 



For the general case, suppose that for each z, ball i has two labels (i,2/i), 
and let iVo denote the set of these labels. Let X' and Y' record both labels, 
so that the first part of the proof applies to their distributions. Call these 
distributions Q' and P'. Assume that X and Y still only record the second 
parts of the labels. For each A cy n there exists a set A' C y£ such that 
X e A if and only if X' e A! , and Y e A if and only if Y f e A'. In fact, 

n 

(xi,...,x n )€A j=l 

It now follows that 

sup \P(A) - Q(A)\ < sup \P'(A') - Q'(^)l < □ 



Lemma 1.69 allows us to prove some approximation theorems for fi- 
nite exchangeable sequences. The next theorem is borrowed from Diaconis 
and Freedman (1980a). It says that the joint distribution of finitely many 
exchangeable random quantities is uniformly approximated by the joint 
distribution of conditionally IID random quantities. 

Theorem 1.70. Suppose that X u . . . , X N are exchangeable random quan- 
tities taking values in a Borel space (X,B). For each P e V and each n 
and each B e B n , let P n (B) stand for the probability that a vector of n 
IID random variables with distribution P lies in B. Let fi N be the distri- 
bution ofP N . Then, for all n and all distinct i u ...,i n in f 1, . . . , N\ and 
all B e B n , 1 J 



Pr((X il> ...,X 4n )€B)- / P n (B)d» N (P) 

Jv 



< 



n(n - 1) 



2N 



(1.71) 



Proof. Let Q x stand for the joint distribution of X u . . . , X N . For x = 
x N ) and B e B n , let H n (B\x) and M n (B\x) be as in Lemma 1.67. 
(If P x is the empirical distribution of x, then M n (B\x) = P n (B)) By 
Lemma 1.69, we have \H n (B\x) -M n (B\x)\ < n(n-l)/2N for all x. From 
Lemma 1.67, we have 

Pr((X il ,...,X in )eB)- J M n {B\x)dQ x {x) 
= (/ H n (B\x)dQ x (x) - J M n (B\x)dQ x (x) < n( ^ 1} . 
All that remains is to show that the distribution fi N of P N satisfies 

J P n (B)dn N (P) = f M n (B\x)dQ x (x). (1.72) 



42 Chapter 1. Probability Models 



Consider, once again, the function h : X N — ► V which maps a point x € Af N 
to P x . Note that h(x) = Mi(-\x) also. Since h(X) = P^, Theorem A.81 
says that 

y P n (B)d» N (P) = J P2(B)dQ x (x). (1.73) 

Since P£(£) = Af n (B|x), it follows that (1.73) is (1.72). □ 
Theorem 1.49 says that infinitely many exchangeable random quantities 
are conditionally IID. Theorem 1.70 says that there is continuity in passing 
from the finite to the infinite case. 



1.5.3.3 Conditional Distributions* 

There is a general form for the conditional distributions of finitely many 
exchangeable random quantities. If Xi, . . . , Xn are exchangeable, it is easy 
to prove that X^+i, . . . , Xn are exchangeable conditional onXi,..., Xk> 

Proposition 1.74. // Xi, . . . , Xn are exchangeable, then Xk+i, . • . , Xn 
are exchangeable conditional on Xi, . . . , Xfc. 

If one uses the conditional distribution of Xk+\ , • . • , Xn given X\ , . . . , Xk in 
place of the distribution of X in Theorem 1.48, one obtains the conditional 
distribution of each subset of Xk+\, . . . , Xn given X\, . . . , Xk- 

Proposition 1.75. Suppose that X = (Xi, . . . , Xn) are exchangeable, that 
k+n < N, and thatji, . . . , j n e {fc+l, . . . , N} are distinct. The conditional 
joint distribution of Xk+j x , • . . , Xk+j n given (Xi, . . . , Xk) — . . . , Xk) 
and P n = P is that of a simple random sample of size n without replace- 
ment from the distribution P with one ball for each ofx\,...,Xk removed 
first 

Example 1.76. Suppose that Xi,...,Xn are exchangeable Bernoulli random 
variables. For simplicity, suppose that we are interested in the first fc+ n XiS. If we 
will observe Xi, . . . , X fc , then it suffices to be able to calculate Pr(Y = t\Y* = f ) 

for each t,t where Y = and Y* = Eti X - lt follows from the 

exchangeability that 

Pr(Y* = t,Y = Q 
Pr(Y* =£*) 

k\ fn\ Pr(Xi = 1,...,X*«-h? = l,Xr+t+i = 0 LL111 Xfe ± n = 0) 

e*Jw Pr(y* = ^) 

W-A;-n+£*+£ fk\/N-k \ /n\/N-k-n\ 



E C?) (m^j) * 
m*=£ VmV 



This section may be skipped without interrupting the flow of ideas. 



1.5. Proofs of DeFinetti's Theorem 



43 



where we have made the substitutions 



N* = N - k, 



m - £ , p m * = 



(e*)(m-e*) Pm 

(£) Pr(y*=£*)' 



The conditional probability that Y = ^ given F* = £* is in precisely the same 
form as the marginal probability of Y = £, except that the distribution of M has 
been replaced by the conditional distribution of M* = M — £* given Y* = £* , 28 
For example, suppose that p m = 1/(N -hi) for m = 0, ...,N. Then, the 
probability of one success in one trial is 

f G)gli) 1 1 Am = + 1 

^ ( N ) iV + l TV + l^AT 2(7V + l)iV 2' 

m=l Vm/ m=1 v 7 

After seeing one observation X\ = 1, we calculate the conditional distribution of 
the number of remaining successes 

. _ CT)ro _ 2K+_l) f ifm*<^, 
*"* ( m ? + t)i " W + D\ ifm'> V- 

The probability has been shifted to higher values of m* after seeing one success. 
Suppose now that we see two observations X x = 1 and X 2 = 0. The probability 
of this is 

N-l (2\(N-2\ ^ ^ N-l /mr 

i ( N ) N + l N + 1 2L> N(N-l) 

m=l Vm/ m=1 v / 



(iV - l)JV(jV + 1) 



JV+1 

N 2 (N-1) 



N(N- 1)(2N- 1) 
6 



1 

3' 



and the conditional distribution of the future is 



* = (m- 2 ) (i)nTT = 6(m* + l)(jV-m*-l) 
(J+iH N(N-1)(N + 1) * 

The maximum of this probability occurs at the value of ra* closest to (N - 2) /2, 
and the probability drops off as m* moves away from (N - 2) /2. 

We can also prove an approximation theorem that applies to the cal- 
culation of conditional probabilities. The theorem says that if we pretend 
that Xi are conditionally IID given P N , and the probability of repeats is 0, 
then the conditional probabilities we calculate for future observations are 
uniformly close to the correct conditional probabilities. 

Theorem 1.77. Suppose that X = {X ll ...,X N ) are exchangeable and 
that Pr(Xi = Xj) = 0 for i ^ j. Let P N be the empirical distribution of 



28 The readers should convince themselves that p^* is indeed equal to Pr(M* = 
m*\Y* = £*). 



44 Chapter 1. Probability Models 



X. Let Y\, . . . , Yjv be conditionally IID with distribution P given Pjv = P. 
Then, for n < N - k, and B € B n , 

| Pr((X fc+ i, . . . , X k+n ) € B\Xi = x u ...,X k =x k ) 

-Pr((n+i, . . . ,n+™) € B|Yi = xi, . . . ,Y k = x k )\ 



2(JV - fe) 

wit/i respect to the distribution of X. 

PROOF. First, we prove that the conditional distribution of X given Yi, . . ., 
Y/b is the same as that of X given Xi, . . . , X*. Call the latter Qx\x x ,... t x k - 
Let M n and H n be as in Lemma 1.67. Let B e B k be such that all points 
in B have A: distinct coordinates. Then 



M k {B\x) = k\(*£)H k {B\x)/N k 



for all x with distinct coordinates. For each such B, Lemma 1.67 says that 

Pr((Y 1 ,...,Y fc )GB,XGC) = ^Pr((X 1 ,...,X fc )€5,XGC). 

In particular, if C is the set of all x with distinct coordinates, then Pr(X € 
C) = 1 and 

Pr((Y 1 ,...,Y^GB) = ^pPr((X 1 ,...,X fc )G J B). 

Let Qx u ...,x k and Qy^...^ stand for the joint distributions of (Xi, . . . , X k ) 
and (Yi, . . . , Yfc) respectively. It follows that for all integrable functions / 
and all B e B k such that all points have distinct coordinates: 

/ f(xu...,x k )dQ Yl ,... t Y k (xi,''-,x k ) (1-78) 

JB 

= ^ ^/(xi,...,^)^,...^^,...^^. (1.79) 

From the definition of conditional distribution, we have 
Pr((X 1 ,...,X fc )eB,X€C) 
= / Qx\x u ...,x k (C\x u • • .,Xk)dQx lt ... 9 x k (x u • • 

If we set /(xi, . . . ,x fc ) = Qx|x lf ...,x fc (C1:ri, • in (1-78), we get 

Pr((Y 1 ,...,Y fc )GS,X€C) 

= / Qx\x u ...,x k (C\x u • • • , • • » 



1.5. Proofs of DeFinetti's Theorem 



45 



and the two conditional distributions are indeed the same for (xi, . . . ,x k ) 
vectors with distinct coordinates. Since such vectors have probability 1 
under the distribution of X, the two conditional distributions are the same 
a.s. 

Now, we apply Lemma 1.67 to both the conditional distributions given 
Xi, . . . , X k and given Yi, . . . , Y k . Let B € B n have all distinct coordinates, 
and let C be the set of all x e X N with distinct coordinates. Then, we get 

Pr((X k + u ...,X k + n ) € B\X X =»!,..., Xfc =Xfc) 

= / H n (B\x k +i,...,x N )dQ X \x lf ...,x k ( x \ x ii'-> x k), 
Jc 

Pr((y fc+1 , . . . , Y k+n ) e B\Y X = x u . . . , Y k = x fc ) 

= / M n (S|a;)dg X | Xli „. |Xfc (ar|a?i,...,x fc ). 
Jc 

Lemma 1.69 says that 

\H n (B\x M , . . . , x N ) - M n (B|x fc+1> . . . , x N )| < 

We must now bound the difference |M n (B|z fc+1 , . . . ,x N ) - M n (B\x)\. As 
in the proof of Lemma 1.69, one of these probabilities is constant on 
{xi, . . . ,xn} u and the other is constant on a subset and is 0 elsewhere. 
The sets B with the largest difference will be the set A where the second 
probability is positive and its complement. Since M n (A\x k +u. • • ,x N ) = 1 
and M n (A\x) = (1 - k/N) n , we get 

\M n (B\x) - M n (B\x k + u . . ., XN )\ < 1 - (l - Ay . 

The conclusion to the theorem now follows. □ 

Example 1.80. If N =1,000,000, n = 100, and k = 100, then the bound in 
Theorem 1.77 is 0.0199, or about 2%. On the other hand, if N =1,000 000 
n = 1000, and k = 1000, then the bound is a useless 1.632. 



1.5.4 The General Infinite Case 
1.5.4.1 Approximation by the Finite Case* 

Theorem 1.70 says that the probabilities concerning n random quantities 
calculated under the finite exchangeable distribution of X U ...,X N are 
uniformly approximated by those calculated under a conditionally IID dis- 
tribution. As TV —> oo, one would expect that the joint distribution of 



This section may be skipped without interrupting the flow of ideas. 



46 Chapter 1. Probability Models 



Xi, . . . , Xn would actually become that of conditionally IID random quan- 
tities. In examining the statement of Theorem 1.70, we note that the first 
term inside the absolute value in (1.71) does not depend on AT, but /zjv 
clearly does depend on AT. If we could show that there exists /xp and a 
subsequence {Nt}^ such that, for all n and all B € B n , 

Hm j P n (B)d» Ne (P) = J P n (B)d» p (P), (1.81) 

then we would have a representation theorem for infinite exchangeable 
sequences. 29 We will prove that this indeed is true in this section. In fact, 
(1.81) would follow from the continuous mapping theorem B.88 if we could 
prove that P n (B) is a bounded continuous function of P and that P Nt 
converges in distribution to a random quantity with distribution /xp. 30 In 
the case of an infinite sequence of exchangeable Bernoulli random variables, 
we could easily prove these facts. 31 

Example 1.82. Let {Xn}™^ be a sequence of exchangeable Bernoulli random 
variables. Then Pn is nothing morethan the proportion of_successes, Xjv, in the 
first N trials. (That is, Pn({1}) = X N andP N ({0}) = l-X N .) The strong law of 
large numbers 1.59 or 1.62 will say that a subsequence X nk converges a.s. (hence 
in distribution by Theorem B.90) to something, call it 9. Then P({1}) = 9, and 
P n (B) is a bounded continuous function of 9 (P n (B) = £ y6B 9 t(v) (l-B) n ~ t(v) , 

where t(y) = ^" =1 2/t). It follows that {X n }%Li is a sequence of exchangeable 
Bernoulli random variables if and only if there exists a distribution /xe such that 
for all n, all distinct ii, . . . ,i n , and all xi, . . . , x n € {0, 1}, 

Pr(X 4l =xi,...,X in =x n ) = J O k (l-0) n - k diie(O), 

where k — $^ =1 This is the representation portion of Theorem 1.47. 

In general, if we wish to show that the Xi are actually conditionally IID 
given some random probability measure P, we will need to prove more than 



29 Since the first term on the left-hand side of (1.71) does not depend on N, 
the limit in (1.81) must be the same for all convergent subsequences. 

30 Diaconis and Freedman (1980a) offer a sketch of an abstract proof showing 
directly that Pn converges in distribution for general random quantities. We will 
not actually prove that Pn converges in distribution to P. Rather, we prove that 
the finite-dimensional joint distributions of Pn converge to those of P. Billings- 
ley (1968), which contains an in-depth discussion of convergence in distribution, 
shows that convergence in distribution requires a condition called tightness in 
addition to convergence of finite dimensional joint distributions. The work of the 
tightness condition is done in that part of the proof of Theorem 1.49 in which we 
prove equation (1.84). An alternative proof is given by Aldous (1985, Section 7). 
A very general theorem is proven by Hewitt and Savage (1955). 

31 Heath and Sudderth (1976) give an alternative proof in the Bernoulli case 
which relies on a different subsequence argument. 



1.5. Proofs of DeFinetti's Theorem 



47 



just (1.81). We will need to show that for every measurable subset C of 

Pr((X il> ...,X in )eB,P€C)= / P n {B)d»j>{P). (1.83) 

Jc 

In effect, we need to prove that P exists and that 
PMB)/c(P)-PWc(P) 

for all B and C. 

The distribution /jLjv in Theorem 1.70 was seen to be the distribution 
of the empirical probability measure P N of X = (X u . . . , X N ). If (1.83) is 
going to hold, it would stand to reason that //p would equal the distribution 
of the limit of Pn as JV -» oo, if the limit exists. That this limit exists will 
follow from the strong law of large numbers 1.59 or 1.62. We will use this 
fact to prove that (1.83) holds. 

1.5.4.2 Proof of Theorem 1.49 

The proof of Theorem 1.49 is a bit complicated, so a brief outline will be 
given first. We use the strong law of large numbers to conclude that (at 
least a subsequence of) the empirical probability measures at each set B, 
{PnCB)}£Li converges to something, which we call P(JB). To show that 
P(-) is a random probability measure, we show that P(B) = Pr(Xi € 
B\Aoo) for some cr-field A^. Since A* is a Borel space, there is a regular 
conditional distribution of which P(B) is a version for all B. It is easy to 
show that P : S -> V is a measurable function. The same calculation that 
lets us prove that P(B) = Pr(Xi £ £|Ax>), namely equation (1.84) in the 
proof, also leads to the conclusion that the X { are conditionally IID. 
Proof of Theorem 1.49. The "if direction is straightforward, and its 
proof is left to the reader. (See Problem 4 on page 73.) For the "only if" 
direction, assume that {-X n }£° =1 are exchangeable. Let P n be the empirical 
distribution of X x , . . . , X n . For each BgB, lim^ P 8 n (B) = P(B) exists 
a.s., as either Theorem 1.59 or 1.62 says. 

For each B € B and each integer i, P(B) = lining £^ = . I B (X m )/8 n , 
a.s. It follows that for each i, P(JB) is measurable with respect to the cr- 
field generated by {X n }~ =i . Hence, it is measurable with respect to the 
intersection (over i) of all of these a-fields, which is the tail a-field of 
{X n }%L v Call the tail a-field ^oo. 

We next prove that for every fc, all distinct i x , . . . , i k , all B x , . . . , B k e B 
and every C € ,4^, 

- k 

Pr({X il eB u ...,X ik eB k }nC)= / HPiB^d^s). (1.84) 

To do this, let Z n = I c n} =1 Ps^Bj) and Z = I c nj =1 P(Bj). Note that 
Z n Z, a.s. as n -> oo, hence Z n Z Z by Theorem B.90. Since Z n 



48 Chapter 1. Probability Models 



is uniformly bounded, E(Z n ) -> E(Z). Since P(B;) is measurable with 
respect to Aoo, the integral on the right-hand side of (1.84) is E(Z). All 
that remains to the proof of (1.84) is to show that E(Z n ) converges to the 
left-hand side of (1.84). To do this, let m = 8 n , and write 

k m m k 

fcnPm(Bi)=/ c ^E-j:n **> > 
e n^(xg+ 5; n^(^) • 

^all ^ distinct j=l at least two ^ equal j=l 

In the last expression above, the first sum has m!/(m - fc)! terms, each of 
which has mean equal to the left-hand side of (1.84). Since m\/[(m-k)\m k ] 
converges to 1, the mean of l/m k times this first sum converges to the left- 
hand side of (1.84). The second sum has m k - m!/(m - fc)! terms, each of 
which is bounded between 0 and 1. Since 1 - m!/[(m - k)\m k \ converges to 
0, so does the mean of the second sum. This completes the proof of (1.84). 

Apply (1.84) with fc = 1, i x = 1, and B x = B to get that E(I B (Xi)I c ) = 
E(P(£) J c ). This means that P(5) is a version of Pr(Xi G £|Ax>) for every 
B. Since (X, B) is a Borel space, this can be assumed to be part of a regular 
conditional distribution, and we can assume that P(B) = Pr(Xi G B\Aoo). 
In this way P becomes a random probability measure so long as we can 
prove that it is a measurable function from (S,A) to (V,Cv)- The cr-field 
C-p was set up so that P is measurable if and only if P(B) is measurable for 
all B. Since .A^ Q A, P(B) is measurable for all B. Also, since Pr(Xi € 
B\Aoo) = P(-B) is a function of P for each B and since P is .Aoo measurable, 
it follows from Theorem B.73 that Pr(Xi € B\P) = P(B). 

Now, let /ip denote the distribution of P. To prove that the Xi are 
conditionally independent given P = P with distribution P, apply (1.84) 
with C = {P G A} for arbitrary A G Cp. The result is 

r k 

Px(X h G B U . • • , X ik G P G A) = U P(S j )( 5 )d/i( 5 ) 

7{P G A) J=1 

= / np(fl j )d/ip(p)= / npr(x ij eB j |p=p)d/i P (P), 

where the first equation is immediate from (1.84), the second follows from 
Theorem A.81, and the third follows from the fact that Pr(X x G B|P = 
P) = P(B). This completes the proof of conditional independence given P. 

For the uniqueness, suppose that /ii and /i 2 are possible distributions for 
a random probability measure P such that the X { are conditionally IID 
with distribution P given P = P. We will prove that the finite-dimensional 



= lo- 



rn* 



1.5. Proofs of DeFinetti's Theorem 



49 



distributions of [i\ and ^2 agree, and then Theorem B.131 says that [X\ = 
/i 2 . Let £1, . . . , B n e B, and let fc 1? . . . , k n be positive integers. We have 
already proven that 

Pt(Xi € Bi, . . . , € Bi,Xfc 1+ i G B2, . . . ,Xfc 1+ ifc 2 € B2, . . • , 

/n « n 

i=l J 1=1 

Hence, the means of all polynomial functions of (P(Bi), . . . , P(B n )) are the 
same according to \x\ and /X2- By the Stone- Weierstrass theorem C.3 the 
means of all bounded continuous functions of finitely many P(B) values 
are determined by the means of polynomial functions. Hence, the means 
of all bounded continuous functions of (P(Bi ),..., P(B n )) are the same 
according to fi\ and fi2- Corollary B.107 says that i±\ and /i2 give the same 
joint distribution to (P(Bi), . . . ,P(B n )), and the proof of uniqueness is 
complete. 

The convergence claim follows from Theorem 1.62 or from Lemma 1.61 
and the fact that the bounded random variables Ib(X%) are conditionally 
IID. □ 



1.5.5 Formal Introduction to Parametric Models* 

The infinite version of DeFinetti's representation theorem 1.49 says that if 
an infinite sequence of random quantities is exchangeable, then specifying 
the joint distribution of all of them can be done by specifying a distribu- 
tion for the limit of the empirical probability measures. Every probability 
measure is a limit of empirical probability measures, and so the space V, on 
which the distribution must be placed, is quite large. There are (at least) 
two problems involved in specifying a distribution over V: 

1. How do you perform the general integration J h(P)dfip(P)? 

2. How do you get the conditional distribution of P given data? 32 

These two problems are related, and the usual method of solving them 
is to say that fip assigns probability 1 to a relatively small subset of V. 
In Example 1.53 on page 29, we saw a case in which //p assigned all of its 
probability to the set of exponential distributions. An alternative is to as- 
sign all probability to normal distributions or t distributions, and so forth. 



This section may be skipped without interrupting the flow of ideas. 

32 We have not tried to prove that (V,Cv) is a Borel space. However, since 
(X 00 ^ 00 ) is a Borel space and P : X°° — ► V is measurable, regular conditional 
distributions exist on X°° and they induce regular conditional distributions on 
V. 



50 Chapter 1. Probability Models 



These cases all have something in common, namely that the set of distribu- 
tions is finitely parameterized. That is, there exists a one-to-one mapping 
between the set of distributions and a subset of a finite-dimensional Eu- 
clidean space. In the normal example, the mapping associates the JV(m, s 2 ) 
distribution with (ra, s) e Ex IR + . With such a parameter mapping, we can 
switch the problem of integration over subsets of P to integration over Eu- 
clidean space. The problem of finding conditional distributions is resolved 
the same way. The conditional distribution in Euclidean space induces the 
appropriate conditional distribution in P. (See Theorem B.28 on page 617.) 
There are cases (see Sections 1.6.1 and 1.6.2) in which we want the range of 
the parameter mapping to be an infinite-dimensional space. In such cases, 
we will need to develop special methods for calculating integrals. 

Now, let P 0 be a subset of P and let & : P 0 -> ft be a bimeasurable 
function, where ft is a set with a-field r. The a-field of subsets of P which we 
need to consider is C 0 = {AnV 0 : AeCp}. Let fip be a probability measure 
on (Vo,Co) and let /i e be the probability on (ft,T) induced by 6' from 
/ip. That is, for each A e C 0 , Mp(^) = Me(Q'(^)), and for each B G r, 
/ie(iB) = /xp(6 ~ l (B)). To integrate a measurable function h : Vo — ► IR, 
we note that 



where 0 is used to stand for an arbitrary element of ft and P — O ' ~ x (0) G P. 
For example, if X — IR and P is the AT(m,s 2 ) distribution, and O'(P) = 
(m, s), then for 0 = (m,s), 6 _1 (^) = P- As another example, let X — 
{0, 1,2}. In this case V is already a finite-dimensional set. We can let 



If P is the distribution that says P({i}) = Pi for i = 0, 1, 2 with p 0 + pi + 
p 2 = 1 and Pi > 0 for all i, then we can let 0'(P) = (po,Pi,P2)- In this 
case Po = V. 

In general, we can make the above discussion precise as follows. Let 
(S, A ji) be a probability space and let (A' 1 , B 1 ) be a Borel space. Suppose 
that {X„}* =1 is an infinite sequence of exchangeable random quantities, 
X n : S —> X 1 . Let X n and X°° be finite and infinite products of X 1 with 
cr-fields B n and B°°, respectively. Let X°° denote the function mapping 
S to A' 00 by X°°( 5 ) = (Xi( 5 ),X 2 (s),...). Similarly, let X n : S X n 
be X n (s) = (Xi(s),...,X n (s)). Let P : A* 00 — > P denote the "limit of 
empirical probabilities" function. Next we introduce a general definition. 

Definition 1.85. A bimeasurable mapping 0' from a subset P 0 of V to a 
subset ft of some Borel space with a-field r is called a parametric index. 
The parametric index is denoted by & : P 0 -> ft- The set ft is called the 
parameter space, and the set P 0 is called the parametric family. 




0 = {(Po,Pi,P2) : Pi > 0,p 0 +Pi +P2 = 1}. 



1.5. Proofs of DeFinetti's Theorem 



51 



Let 0' : V — > ft denote a parametric index. We have constructed the 
following sequence of functions: 

yoo T> r\f 

s -> x°° ^v^n. 

Let the function O : S -> ft be defined by 0(s) = 0' (P (5))). (Note 
that the value of 9 is the same as the value of 0', hence we will often 
find it convenient to use the symbol 0 to refer to both 0 and 0'.) We call 
0 the parameter. Let /ie be the probability induced on (ft, r) by 0 from 
ti. Let Ax be the sub-a-field of A generated by X°°. Since (X 00 ^ 00 ) is 
also a Borel space, regular conditional distributions given © exist. For each 
A € Ax, let P' e {A) = Pr(A|0)(s) for all s such that 0(s) = 0. For each 
B e let P e (B) = P^X 00 ' 1 ^)). In words, {P e : 0 € ft} specifies the 
conditional distribution of X°° given 0. 

Example 1.86. Let X 1 = IR and let Po be the set of all normal distributions. 
Assume that (j,p assigns probability 1 to the set Vq. We can let 6'(P) be the 
vector consisting of the mean and standard deviation of the normal distribution 
P. Then 0(s) is the vector consisting of the mean and standard deviation of the 
limit of the empirical distribution of a sequence {X n }£° =1 of exchangeable random 
variables. By the strong law of large numbers 1.63 and the fact that the Xn are 
conditionally IID given P, G(s) is also the limit (a.s.) of the sample average X n = 

]£<Li Xi l n and the sample standard deviation ^£ i==1 (X» - X n ) 2 /(n - 1) of the 
data sequence. If 0 - (/x,a), then P 0 is the distribution that says that {X n }^° =1 
are IID N(fi,a 2 ) random variables. The notation P' e stands for the probability 
measure on (S,Ax) defined by Pi{X°°- 1 (B)) = P g (B) for B e B°°. 

The probability measures P 0 for 0 e Q are on the space (X°° 9 B°°). They 
induce probabilities on all of the spaces (X n , B n ), for n = 1, 2, . . . via the 
obvious projections. It will prove convenient to refer to all of these induced 
probabilities by the same name, P e . That is, if A e B n , let P e (A) denote 
Pe(A x X 1 x X 1 x • • •). This will be very convenient without causing any 
confusion. If it becomes important to know over which space, (X n ,B n ) or 
(X°°, P e is defined, we will be explicit. 

Sometimes the parameter 0 can be expressed as a meaningful function of 
the distribution P, say H{P) which is also defined for distributions outside 
of the parametric family. For example, H{P) = / xdP(x), the mean of the 
distribution, is defined for every distribution with finite mean whether or 
not that distribution is a member of a parametric family of interest. When 
this occurs, it may be that H is continuous in the sense that H(P n ) £ 
H(P) if hm n _ ¥00 P n _ P. The distribution of 0 can then be considered as 
an approximation to the distribution of H(P n ), where P n is the empirical 
probability measure of the first n observations. 

Example 1.87 (Continuation of Example 1.53; see page 29). In the Exp(0) 
distribution, 9 is one over the mean of the distribution. So, H(P) = (/ xdP(x)) _1 
and H(P n ) = Xi/n)' 1 = 1/X n . Indeed, 1/X n Z 9, so that we can take 



52 Chapter 1. Probability Models 

the distribution of 0 to be an approximation to the distribution of l/X n . (See 
Problem 21 on page 76.) 

Example 1.88 (Continuation of Example 1.38; see page 18). The marginal pos- 
terior for M can be calculated by integrating a out of (1.25) (after changing the 
0 subscripts to 1), or by using the fact that M is the limit as m goes to oo of 
Ym, so the di stribution of M is the limit of the distributions of the Y m . Since 
the tai{n\, >/[l/m + l/Ai]6i/ai ) densities converge to the t ai {n\, y^i/[aiAi] ) 
density as m — ► oo, Scheffe's theorem B.79 says that this limit is the distribution 
of M. 



1.6 Infinite-Dimensional Parameters* 

An alternative to the use of finite-dimensional parameter spaces is to at- 
tempt to place a probability distribution on an infinite-dimensional space 
V. It is common to call such models nonparametric. We will consider two 
types of probability measures on infinite-dimensional parameter spaces, 
Dirichlet processes and tailfree processes. 

1.6.1 Dirichlet Processes 

Ferguson (1973) gives a probability measure on an infinite-dimensional 
space V for which certain calculations are simple. We can think of P as a 
stochastic process as in Section B.5 with index set B, a cr-field of subsets 
of X. To specify a distribution for P, we need to specify the joint distribu- 
tion of (P(J3i), . • • , P{B n )) for all n and all B u . . . , B n e B in such a way 
that the distributions are consistent according to Definition B.132. One 
way to do this is as follows. Let a be a finite measure on (X, B). For each 
integer n > 0 and partition B x , . . . , B n of X, define the joint distribution 
of (P(Bi),...,P(B„)) to be the Dirichlet distribution Dir n (ai, . . . , a n ), 
where a* = <*(£») for i = l,...,n. This is a distribution for a vector 
(Yi, . . . , Y n ) such that ^=1 y i = 1 and such that (*!>•••> F ") have j° int 
density 

r ir i t''n aw ^ 1 " 1 • • • - yi - - - - - yn - i)an ~ 1 ' 

r(ai)---r(a n ) 

The Dirichlet distribution is a multivariate generalization of the Beta dis- 
tribution. To avoid having to deal separately with the cases in which some 
of the sets B { have ol{B { ) = 0, we will extend the definition of the Dirichlet 
distribution to allow some of the a { to be 0. In this case, those coordinates 
corresponding to a { = 0 are equal to 0 with probability 1 and the rest of 
the coordinates have the usual Dirichlet distribution. 

We prove next that this specification of the distribution of P is consistent. 



"This section may be skipped without interrupting the flow of ideas. 



1.6. Infinite-Dimensional Parameters 53 



Theorem 1.89. Let P be a random probability measure on a Borel space 
(X, B), and let B\, . . . ,B n G B partition X. Let a be a finite measure on 
(X,B) with a(X) > 0, and let a* = a(Bi) for all i. To say that 

(P(Bi), . . . , P(B n )) ~ Dir n (a x , . . . , a n ) 

specifies a consistent set of distributions in the sense of Definition B.132. 

Proof. Let A\, . . . , A p be elements of B. Set £?i, . . . , B n equal to the 
partition consisting of the constituents of A\,...,A P . That is, each B{ 
equals one of the 2 P sets C\ Pi • • • H C p , where each d is either A{ or 
A? fori = 1, . . . ,p (e.g., Ai n A$ n A 3 • • • n j4£). Let G { = P(i4<) for 
i = 1, . . . ,p. We need to show that for each i = 1, . . . ,p and each set of 
numbers £i, . . . . . . 

t 1 !^,^ 1 '-'^^ 1 ' = ^Gi l ...,G 4 _i l G 4> ... l Gp(*l, • • • >t*-l»*i+l, • • • itp). 

(1.90) 

Let ^ = P(Bj) for j = l,...,n. Let a = {j : Bj C AJ. Then, the 
expression whose limit is being taken on the left-hand side of (1.90) is 

Pr^gy, fori = l,...,pj =Pr((ri,...,y n )eC(t)), (1.91) 
where 

= N • • • i J/n) : X) 2/j < for i = 1, . . . , 

Now, fix an z, and let £{,..., B^ be the constituents of {Aj '. j ^ i}- 
Each B\ is the union of two of the By (For example, if i = 1 and p = 3, 
then (Ax fl ^ n A 3 ) U (Af fl^n A 3 ) is one of the B] .) Let Z s = P(Bj) 
for 5 = 1, . . . , m. The proposed distribution implies that (Zi, . . . , Z m ) has 
Dir m (/3i, . . . , /3 m ) distribution, where /? s = a JX +ctj 2 when = UB j2 . 
For j ^ z, let dj = {s : B l s C ^}. The limit as U — * oo of the expression in 
(1.91) is 

t iim o Pr((yi,...,y n )eC'(t)) = Pr(yi,...,y n )6 |J c(t) 

\ ueIR 
= Pr((Zi,...,Z m ) G (1-92) 

where 

It is easy to see that (1.92) is the same as the right-hand side of (1.90). □ 
Since the distributions specified are consistent, we can use them for the 
distribution of P. 



54 Chapter 1. Probability Models 



Definition 1.93. If a is a finite (not identically 0) measure on {X,B) 
and P is a random distribution such that, for each n, and each partition 
{Bi, . . . ,B n } of X, 

(P(Bi),...,P(B n ))-Dir n (a 1 ,...,a n ), 

where a* = a(Bi) for i = 1, . . . , n, then we say that P has Dirichlet process 
distribution with base measure a, denoted by Dir(a). 

The Dirichlet process is useful only if we can do the necessary calculations 
for making inference. The most crucial is updating in the light of data. 

Theorem 1.94. Suppose that {X n }™ =l is a sequence of exchangeable ran- 
dom quantities, that they are conditionally independent with distribution P 
given P = P, and that P has Dir(a) distribution. Then the marginal dis- 
tribution of each Xi is the probability measure a/a(X) and the conditional 
distribution of P given X x = x u . . . , X k = x k is Dir(/3), where (3 is the 
measure defined by /3(C) = a(C) + 5Zi=i f or each C. 

PROOF. First, we prove the claim about the marginal distribution of Xi. 
For BeB, 

Pr(Xi e B) = EiPriXi G B\P)) = E(P(B)) = 

where the first equality follows from the law of total probability B.70, and 
the last follows from the fact that each coordinate of a Dirichlet distribution 
has Beta distribution. 

By the form of the purported posterior, it is clear that if we can prove the 
result for k = 1, we can extend it to arbitrary k by induction. Let fip denote 
the Dir(a) measure on the space of probability measures V. For arbitrary n 
and partition Bi, . . . , B n , and arbitrary B € B and t\, . . . , t n , let A = {P : 
P(Bi) < U, fori = l,...,n}. Assume that a(B) < a(X). 33 Define, for 
i = 1, . . . ,n, B? = BiHB and B\ = B { nB c . Then B?, . . . ,B°,B}, . . . ,Bi 
form a partition of at most 2n nonempty sets. 34 In particular, we can write 

p(B)=Er=iP(^°)- Define 

A B = {(^l,...,^2n) : *t + ^n+i < *i, for * = 1,...,^}, 

and note that P e A if and only if (P(Bf ), . . -,P{B l n )) e A B . Let ft = 
a(B/) for z = 1, . . . ,n, and let ft = a(B?) for z = n-hl, . . . ,2n. Let c = {z : 



33 If a(B) = a(A'), it is trivial to prove that / fl n P \ Xl (A\x)du Xl (x) = Pr(P G 
e B), where u P \ Xl is the conditional distribution of P given Xi to be 
defined later in this proof, and is the marginal distribution of Xi already 
determined. 

34 Recall the extended definition of the Dirichlet distribution in which on = 0 
means that the zth coordinate is 0 with probability 1. 



1.6. Infinite-Dimensional Parameters 55 



Pi 7^ 0}, and let k be the highest number in c. Let c f = {i : Pi ^ 0} \ {k}. 
If Pj = 0, let Zj = 0 in the following equations. Then we can write 

Pr(Pe A,Xi eB) 

= Pr(P(Bi)<ti,...,P(B n )<t n ,XiGB) 

= / P(B)dtip(P) (1.95) 

where /?j = Pj + l and /?/ = A for i ^ j. 

For each x e A*, let a x denote the measure defined by a x (C) = a(C) + 
Ic{x) for each C e B. Let iip\ Xl {-\x) denote the Dir(a x ) measure on V. 
It is easy to see that for x € B, iip\ Xl (-\x) says that the joint distribution 
of {P(Bf ), for i = 1, . . . ,n and j = 0, 1} is Dir 2n (P{^ • • , /^J* where i is 
such that x G Bj. Hence, /ip^ 0<4|a;) equals 

j=i j a b lUec L \Pi) ied \ i€C / tec 

It follows that 

= J Pp\x x (A\x)da(x) 



which is the same as the last expression in (1.95). That is 
fip lXl (A\x)d^t Xl {x) = *MP e A,X t e B), 

IB 

which is what it means to say that the conditional distribution of P given 
X\ = x is Dir(a x ). □ 
By combining Theorems 1.94 and 1.89, we see that the posterior distri- 
bution found in Theorem 1.94 is a regular conditional distribution. 

Example 1.96. If a is a continuous measure (i.e., every singleton has 0 measure), 
then the posterior measure p is a mixture of discrete and continuous parts. There 
is mass 1 at every observed data value, but no other values have positive measure. 



56 Chapter 1. Probability Models 



Ferguson (1973) and Blackwell (1973) prove that there is a set of discrete 
distributions Vo C V such that the Dir(a) distribution assigns probabil- 
ity 1 to Vq. Sethuraman (1994) proves an alternative theorem, which not 
only shows that the Dirichlet process is a probability on discrete distri- 
butions, but also gives an algorithm for approximately simulating a CDF 
with Dir(a) distribution. The result of Sethuraman (1994) is that the set of 
points on which the Dir(a) distribution concentrates its mass is an infinite 
IID sample Yi, Y 2 , . . . from the probability a/a(X), and the probability as- 
signed to Y n is P n , where Pi = Qi, and for n > 1, P n = Qn Yl^iO- ~ Qi)> 
where the Qi are IID with Beta(l, a(X)) distribution. What we prove here 
is a very simple theorem of Krasker and Pratt (1986) which implies that the 
Dir(a) distribution assigns probability 1 to a set of discrete distributions. 

Theorem 1.97. Let {X" n }J?Li be conditionally IID with distribution P 
given P = P. For n > 1, define 

a n = Pr(X n is distinct from Xi, . . . , X n -\). 

If ^imn—oo a n = 0, then P is a discrete distribution with probability 1. 

Proof. Define 

B e = {P : 3A € B such that P(A) > e and P({x}) = 0 for all x e A}. 

It suffices to prove that Pr(J3 e ) = 0 for all e > 0. The conditional probabil- 
ity, given P = P and X u . . . , X„_i, that X n is distinct from X Xl . . . , X n _i 
is at least e for all P € B e . It follows that 

a n = EPr(X n is distinct from Xi, . . . ,X n -i\Xi,. . . ,X n _i,P) 

> E [Pr(X n is distinct from X u . . . , X n -i|-Xi, • • • » x n-u P)Ibc(P)] 

> €Pr(B e ). 

Since lim n ^oo a n = 0, Pr(B c ) = 0 for all e> 0 is necessary. □ 
For the Dir(a) distribution, it is easy to calculate a n = a(X)/[a(X) + 
n-1]. 

The posterior predictive distribution of a future observation is a weighted 
average of the prior measure a/a{X) and the empirical probability measure. 

Proposition 1.98. Assume that {X n }™ =1 are conditionally IID with dis- 
tribution P given P = P and that P has Dir(a) distribution. The poste- 
rior predictive distribution of a future Xi given X\ = #i, . . . ,X n = x n is 
P/[a(X) + n} } where (3(C) = a(C) + E^i fc(^i)- 

The predictive joint distribution of several future observations can be ob- 
tained by applying Proposition 1.98 several times, each time after condi- 
tioning on one more random variable. This gives a straightforward way to 
generate a sample whose conditional distribution is P, which itself has a 
Dirichlet process distribution. The joint distribution can also be described 
as follows. 



1.6. Infinite-Dimensional Parameters 57 



Lemma 1.99. 35 Assume that {X n }^ =1 are conditionally IID with distri- 
bution P given P = P and that P has Dir(a) distribution. Let n > 0. If 
p is a partition of {1, . . . ,n}, let g(p) be the number of nonempty sets in 
p, and let fci(p), . . . , k g ^(p) be the numbers of elements of the g(p) sets. 

(Note that fc<(p) = n for all p.) For each x e X n , let R{x) be the 

partition o/{l,...,n} which matches x. (That is, x has g(R(x)) distinct 
coordinates, and for each set A in the partition R(x), those coordinates of 
x whose subscripts are in A are all equal to each other.) For each x £ X n , 
define Z(x) € A^W*)) to be the vector of distinct coordinates such that 
Z(x)i is repeated ki(R(x)) times in x. For each p and each subset B of X n , 
define B p to be that subset of X 9 ^ which consists of the set of distinct co- 
ordinates of points inBC)R- l (p). (That is, B p = Z[B D R~ l (p)).) Define 
the measure v on X n by 

v{B) = ]T cfM{B p ). 

All p 

The joint distribution of X x , . . . , X n has the following density with respect 
to the measure v: 

n 9(P) ki(p) 

fx(x) = H(a(X) + i - I)" 1 £ Ir-H p) (x) J] II H{Z(x)i})+j - 1), 

All p i=l J=2 

where an empty product is taken to be 1. 

Proof. Let X = (X u ... i X n ). We need to show that PrpT e B) = 
S B fx{x)dv{x) for all B C X n . Let B C X n . We will show that 

Pr(X e B O R-\p)) = / f x {x)dv{x) 
JBnR-i(p) 

for every partition p, and the result will then follow by adding up finitely 
many terms. It is easy to see that v(C) = a^(C p ) for each p and each 
subset C of R \p), and that f x is a function of Z. It follows that 



/ fx{x)dv(x) 
JBr\R-i{p) 

= / fx{Z~\z))da^\z) (L100) 

n n 9(p)ki(p) 

= U("(X) + < - i)" 1 / II II + J - l)da°M(z). 

1=1 Jb p t=l 3 =2 



5 This lemma is used in Examples 1.102 and 1.103. 



58 Chapter 1. Probability Models 



Fix p and write B p = Bi U B 2) where every coordinate of every point in B\ 
has 0 a measure. The points in B 2 have at least one coordinate with pos- 
itive a measure. There are at most countably many values of y such that 
a i{y}) > 0, say they are y u y 2 , . . .. For k = 1, . . . , g(p), i x , . . . , i k distinct el- 
ements of {1, . . . ,g(p)}, and £ u ...J k distinct integers, let B 2; i u ...,i k] e u ...,£ k 
be the subset of B 2 in which z it = y it for t = 1, . . . , k and all other coordi- 
nates of z are distinct points with 0 a measure. These sets are disjoint, and 
their union is B 2 . On each of these sets, and on B\, the integrand in the 
far right-hand side of (1.100) is constant. Hence, the far right-hand side of 
(1.100) can be written as 

n ( 9(p)ki{p) 

H(a(X) + i- l)- 1 la^(B 1 )Hl[(j - l) (1.101) 

i=l I i=l j=2 



Now, we will show that (1.101) is the probability that X e BC\R~ l {p). Let 
ji> • • ">3g{p) De 9{p) coordinates that are distinct for all x in R~ l (p). Let 



The first term in (1.101) is the probability that W € B\ and that the other 
coordinates of X all match the coordinates they need to match in order for 
R(X) = p. Also, each of the summands in the second term in (1.101) is the 
probability that W G B^ti,...,**;*!,...,** anc * tnat tne tne otner coordinates 
of X all match the coordinates they need to match. The sum is then the 
probability that X e B n R' l {p). D 

Example 1.102. As a simple example of Lemma 1.99, suppose that X = IR and 
a is some finite continuous (no point masses) measure. The measure v is then 
the sum of the various fc-dimensional product measures of a for k = 1, . . . , n over 
the sets where there are exactly k distinct coordinates. For example, if n = 3, 
then the partitions are 



So, g(pi) = 3, and fc»(pi) = 1, while g(p 2 ) = g(ps) = gfa) = 2, and so on. Also, 

R~ 1 (Pl) = {(3l,X2,X 3 ) : Xi # X 2 ,Xi ^ X 3 ,X 2 7^ X 3 }, 




W = Z(X) = {X jl ,...,X jtip) )eB p . 



Pi = {{!}, {2}, {3}}, p 2 = {{l,2},{3}}, P3 = {{1},{2,3}}, 
P4 = {{1,3},{2}}, P 5 = {{1,2,3}}. 



R-\P2) 

R~\P4) 
R~\P5) 



{(Xi,X 2 ,X 3 ) : X\ = X2,Xi ^ x 3 }, 

{(xi,x 2 ,x 3 ) : x 2 = x 3 ,xi ^ x 3 }, 
{(xi,x 2 ,x 3 ) : Xi = X 3 ,Xi ^ x 2 }, 
{(xi,x 2 ,x 3 ) : xi = x 2 = x 3 }. 



1.6. Infinite-Dimensional Parameters 59 



The measure v is a 3 on R 1 (pi ) plus a 2 on R 1 (P2) U R 1 (p3)UR 1 (^4) plus a 
on i? -1 (p5). Also, 

fv(x) = 1 (2 i{x€R- l (p 5 ), 

JX{ } a(X)[a(X) + l][a(X) + 2] \ 1 otherwise. 

To calculate the probability that X is in the unit cube B, say, we must add up 
five integrals, one for each partition. 



Pr(0 < Xi < 1, for i = 1,2,3) 

\ 

a{X)[a{X) + \][a{X) + 2] 

+ 2a(£ P5 )^. 



ja 3 (B pl ) + a< 



(B P2 ) + a 2 (£ P3 ) + a 2 (£ P4 ) 



For concreteness, suppose that X is [—1,1] and a is Lebesgue measure. Then 
a{X) = 2 and a 5(p) (J3 p ) = 1 for all p. So, Pr(X € B) = 0.25, substantially above 
the product probability a 3 (B)/a(X) 3 — 0.125. The negative unit cube (all Xi 
between —1 and 0) also has probability 0.25, while the six other subcubes each 
has probability 1/12. 

Straightforward applications of Dirichlet process priors to one-sample 
problems are singularly uninteresting, except in cases in which one might 
use the bootstrap technique (see Section 5.3). There are, however, ways to 
make use of Dirichlet process priors in less straightforward fashion. 

Example 1.103. Suppose that {X„}S° =1 are conditionally IID with distribution 
P given P = P, and we are sure that there exists a finite number 9 such that 
P((— 00, 9}) = 1. Unless 9 is known, it is not possible that P has Dirichlet process 
distribution. If we let 0 be the unknown least upper bound on the support of the 
XjS, we can suppose that P given 9 = 9 has Dir(ae) distribution, where a$ is 
a finite measure on (—00, 0]. Let 6 have prior density /e- Let ce = ae((— 00, 6]). 
Suppose also that a$ is absolutely continuous with respect to Lebesgue measure 
with Radon-Nikodym derivative ae. Using Lemma 1.99, the likelihood function 
for O after observing X\ , . . . , X n to obtain g distinct values yi,...,y g with ki 
repetitions of yi is 

n?-x (<*(*) n*: a o-i)) 
m = — nr=i(<*+i-i) — /|max{wi v9} '°° )W ' 

Hence, the posterior distribution for B can be found. Conditional on G = 9, 
the posterior for P is a Dirichlet process with measure (3e equal to a$ plus 
point masses at the observed values. The marginal posterior of F is a mixture of 
Dirichlet processes. Antoniak (1974) studied mixtures of Dirichlet processes and 
describes many of their properties. 

Example 1.103 can be somewhat deceiving if one is really trying to model 
data from a continuous distribution. If g = n in that example, then all of 
the k{ = 1. If ce is the same for all 6, then the likelihood function is the 
same as one would obtain by modeling the data as conditionally IID given 



60 Chapter 1. Probability Models 



6 = 6 with density ae{-)/ce. This is probably not the effect one thought 
one was achieving by using a Dirichlet process. That is, there is nothing 
the least bit nonparametric about the analysis one ends up performing in 
this situation. In fact, this phenomenon is quite general. 

Lemma 1.104. 36 Suppose that person 1 believes that {X n }^L :1 are IID 
with a continuous distribution. For each 6 6 fi, let a$ be a continuous 
finite measure with olq(X) — c for all 0. Suppose that person 2 models the 
data as conditionally IID given P = P and 6 = 0 with distribution P and 
that P given 6 = 0 has Dir(a$) distribution. Suppose that person 3 models 
the data as conditionally IID given 6 = 0 with distribution ote/c. Assume 
that olq 77 f or a M 0- Suppose that person 2 and person 3 use the same 
prior distribution for 6. Then person 1 believes that, with probability 1, for 
every n, person 2 and person 3 will calculate exactly the same posterior 
distributions for 6 given X\ , . . . , X n . 

Proof. First, note that the density fx in Lemma 1.99 is constant in 0 for 
every data set that has no observed values at points where olq puts positive 
mass. Such a data set will occur with probability 1 according to person 1. 
Let a$ = dae/drj. With probability 1 (according to person 1) person 2 will 
then have likelihood function proportional to Y\7=i a o{ x i)' This is the same 
as the likelihood function that person 3 will have. Hence, persons 2 and 3 
will calculate the same posterior. □ 

Example 1.105. Since the Dirichlet process assigns probability 1 to discrete 
CDFs, it may not be considered suitable for cases in which one really wants a 
continuous CDF. One possibility is to model the observable data {X n }%Li as 
Xi = Yi + Zi where {Y n }Z=i are conditionally IID with CDF G given G = 
G, where G has Dir(a) distribution, and {Z n }n=\ are independent of {Y n }^=i 
and of G and of each other with a distribution having density /. The posterior 
distribution of G is not easy to obtain in this case, but a method that can be used 
to approximate it will be given in Section 8.5. Escobar (1988) gives an algorithm 
for implementing this method. 

1.6.2 Tailfree Processes" 1 " 

In this section, we introduce a second class of distributions over an infinite- 
dimensional space of probabilities. This time it will be possible for the 
random probability measure P not to be discrete. 

Deanition 1.106. Let (X, B) be a Borel space. For each integer n > 0, let 
7r n be a countable partition of X whose elements are in B. Suppose that 



36 This lemma is used to show why it may not be sensible to use a Dirichlet pro- 
cess for the prior if there will also be an additional finite-dimensional parameter 
of interest. 

+This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 



1.6. Infinite-Dimensional Parameters 61 



7r n +i is a refinement of 7r n for each n. Let the trivial partition be 7r 0 = { 
Let C = U^ =l 7T n . Suppose that B is the smallest a- field containing C. For 
each n, let {V n -s ' B € 7r n } be a collection of nonnegative random variables 
such that the collections are mutually independent. For each n > 1 and each 
B\ 2 * • • 2 B n with JB^ € 7^, define 

n 

P(5„) = n^;B i . 

i=l 

Then we say that the stochastic process P = {P(B) : B £ C}. is tailfree 
with respect to {{n n }%Li, {V n] B : n > 1,B e 7r n }). For each n > 1 and 
B e7r n , define ps(B) = C, where C £ 7r n _i and B c C. Call this the most 
recent superset of B. For each x G X and each n, define 

C n (x) = that B e 7r n such that x € 
KiW = V r n;Bn(a;) . 

Note that the random variables in {V n . B : B e 7r n } do not have to be 
independent of each other, but they must be independent of those in {V m;B : 
B e 7r m } for m ^ n. The class of tailfree processes was introduced by 
Freedman (1963) and Fabius (1964). Also, see Ferguson (1974). 
A necessary condition for P to be a random probability measure is that 

S V ^c = 1. (1.107) 

All C such that ps{C) = B 

Another necessary condition is that if B n is a union of elements of 7r n for 
each n,BiDB 2 D..., and n™ =l B n = 0, then 

II 51 ^) =0 ' a - s - (1.108) 

n=l \{B:BE7r n ,BCB n } y 

These two conditions are also sufficient. (See Problem 51 on page 81.) For 
the remainder of this book, when we refer to a tailfree process, we will 
assume that it is a random probability measure. 

As an example we can show that Dirichlet processes are tailfree with 
respect to every sequence of partitions. 

Example 1.109. Let P have Dir(a) distribution. Let {7r n }~ 2 be a sequence 
ot countable partitions such that 7r n+1 is a refinement of n n for all n. We can 
prove that P is tailfree with respect to {7r n }~=i- For each n and each B € 7r n , 
set V n , B = P(B)/P(ps(B)). The fact that the collections {V n - B : B € 7r n } are 
independent follows from a well-known fact about Dirichlet distributions. (See 
Problem 52 on page 81.) 

As another example, we can place a tailfree process distribution on the 
class of distributions symmetric around a point. 



62 Chapter 1. Probability Models 



Example 1.110. Let X = IR, and let tti = {(-oo, 0), {0}, (0,oo)}. Let {7r~}£° =2 
be a sequence of nested partitions of (— oo,0). For each n > 1, let 7r+ be the 
partition of (0, oo) formed by the negatives of the sets in 7r„ . Let 7r n = ir^ U7r£ U 
{0}. So long as V n] B = V^c whenever B = — C, P will be symmetric around 0. 

When X ~ P given P and P is a tailfree process, the predictive distri- 
bution of X can be computed. 

Proposition 1.111. Let P be a random probability measure that is tailfree 
with respect to ({7r n }£L 1? {V n ,B : n > 1, B G 7r n }) and X ~ P given P. Lei 
A E 7r m , and Ze£ A = n^B*, where B* 6 7Ti /or each i < m. Then the 
predictive probability that X £ A is 

m 

Hx(A) = EP(^l) = n E (^^)- ( L112 ) 
1=1 

It is sometimes possible to find a density for the predictive distribution of 
X with respect to some measure v of interest (like Lebesgue measure). 

Lemma 1.113. 37 Let (X,B,v) be a a-finite measure space. Assume that 
P is tailfree with respect to ({n n }%Li, {Vn,B : n > 1,B G 7r n })- Assume 
that each element of each 7r n has positive v measure. For each x G X, let 

^=Rck)n^(x)). 

Iflimn^oofnix) = f(x), a.e. [i/], and f f{x)dv{x) = 1, then f = d^x/dv. 

Proof. We need to prove that for each B G C, ^x(B) = J B f{x)dv{x). 
The extensions to the smallest field containing C and the smallest a-field 
containing C are straightforward. Let B G 7r n , and let 5 C 5j 6 ^ for 
i = 1, . . . ,n. Then ^(x) = for all x € B and i = 1, . . . ,n. By (1.112), 
we have, for each xq G B, 

Mx(B) = f[E(Vi(a;o)) = i/(C n (xo))/n(a:o)= / /n(z)<M*)> 
i=i Jb 

since / n (x) = f n (x 0 ) for all x G B and B = C n (x 0 ). For fc > n, write 
B = U a<Ey4 £>a as the partition of B by elements of ir k . Since f k is constant 
on each D a (call the value / fc (x Q ) for x a G B Q ), we can write 

[mx)m*) = Y,[ fkWM*) = Eil E w( x ^ 

= 5>x(A,) = Mx(B)- 



This lemma is used in Example 1.114. 



1.6. Infinite-Dimensional Parameters 63 



Hence, we have f B fk{x)dv(x) = fix(B) for all k > n. So, 
lim / fk{x)dv{x) = fJLx{B). 



k 

It follows from Scheffe's theorem B.79 that 



lim / fk{x)dv{x) = / f{x)dv{x). 
k -*°° Jb Jb 



□ 



Example 1.114. Suppose that v is a finite measure and, for each n and B, 
E(Ki ; b) = v(B)/v(ps(B)). In the notation of Lemma 1.113, f n (x) = l/i/(#) for 
all n and a;. In this case, fix = v/v(X), and the density is constant. In fact, this 
gives a convenient way to force a tailfree process to have a desired predictive 
distribution for X. 

Tailfree processes are conjugate in the sense that the posterior is tailfree 
if the prior is tailfree. 

Theorem 1.115. Let P be a random probability measure that is tailfree 
with respect to ({n n }%Li, {V n; s : n > 1,2? € 7r n }) and X ~ P given P, 
then P given X is tailfree with respect to ({'7r n }^L 1) {V n ,B : n > 1,2? £ 7r n })- 

Proof. Fix k and m < . . . < n^. Let V\ for i = 1, . . . , k, be a finite (say, 
with size Si) collection of the elements of 7r ni , and let Fyi denote their joint 
CDF. Let F V i\ x {-\x) denote the conditional CDF of V { given X = x. Let 
A € B, and let d C IR S % for z = 1, . . . , We must show that 

r k 

Pr(XGA,rGC i ,fori = l,...,fc) = / JJ F v *|x(Ci|x)d/ijK*), 

(1.116) 

where /ix is given by (1.112). If we can prove (1.116) for all A G C, it will 
be true for all A e B by Theorem A.26. 

First, we find F V i\ X - By definition, for C € IR S \ Fyi\x(C\-) is any mea- 
surable function h such that 

y h(x)dfi X (x) = Pr(X € A, V* € C), 

for all >1 € Once again, the equation need only hold for all A e C. We 
propose the following function: 

* (aj) = eomS)) ' (LU7) 

where V m (#) is defined in Definition 1.106 to be the random variable corre- 
sponding to that element of partition 7r m which contains x. Note that h is 
constant on each element of 7r ni . We find it convenient to let h(B) stand for 



64 Chapter 1. Probability Models 

that constant value if x G B G 7r ni . Let A = B n G 7r n , let m = max{rii,n}, 
and define, for j = 1, . . . , ra, 



Vj-Bj if j < n, j ^ n u ACBjE TTj, 

1 if j > n, j ^ rii, 

(Va i; B n . , V*) if j = rii < n, A C B Ui G 7r ni , 

I (£*, V*) if j = m>n, 

!IR if j < n, j ^ n*, A C G 7ij, 

{1} if j > n, j ^ n 4 , 
ExC ifj = n<, 

where B* is the union of all elements of 7r ni which are subsets of A. If 
j = ni < n, it is possible that the first coordinate of Uj is repeated later in 
the vector. With these definitions, it is clear that 

Pr(XG A,V l eC) 

= Pr(X g A, Uj eDj, j = l,...,ra) 

= / / v>m,i TT v>jdF Uvn (um) • • • dFutiui) 
= EiUnMV^UEiUj) 

E(V nj>Bni / c (^)) n j#ni E(V >iBj ) if m < n, 

£ Bsuch E(/c(V i )V nii B)n;iT 1 E(K j . Bj( fl ) ) ifn,>n. 

that fiCjl 

If we let h(x) be as in (1.117), then 

/ h(x)dn x {x)= / h(x)dft x (x). (1.H8) 

If Tij < n, there is only one term in the sum in (1.118). It is the term with 
B = B ni such that AC B ni , and the integral equals 

/>(£„>* 04) = — E(Vn b ) 11 E( ^ ;B ^ 

which is what we needed to show. Similarly, if n* > n, the only terms in 
the sum in (1.118) which appear are those for which B C A, and A is the 
union of these sets. The sum becomes 

]T h(B)px{B) (1.119) 

B such that £ C A 



1.6. Infinite-Dimensional Parameters 65 



B such that BCA J =1 

where Bj(B) € and 5 C Bj(B) for j = 1, . . . ,n;. The right-hand side 
of (1.119) can be written as 

Ui-l 

£ E(W)V n4iB ) J] E(V j;B , ( B)), 

B such that BCA i =1 

and it follows that h is a version of Fv<jx(C|-)- 

To prove (1.116), let A £ n n . If n < rife, break A into its intersections 
with all elements of n nk and then add up the sides of (1.116) over the 
disjoint intersections to see that (1.116) holds. Hence we need only prove 
(1.116) if n > rife. As before, let Bj be the element of partition nj such that 
A C for j = 1, . . . , n. Since n > rife, ^4 is a subset of one and only one 
of the partition elements for each 7r n . partition. Define, for j = 1, . . . , n, 

w. - / 5 i if i £{ni,...,rife}, 

I M;B„V«) if * = 



n. = / [M] ifi^{ni,...,rife}, 
J \ RxQ if j = m. 



In what follows, W ja will be the first coordinate of Wj. It is easy to see 
that 

Pr(X e A,V* e d, fort = l,...,fc) 

= Pr(lG E^., for j = l,...,n) 

= L * ' * / II w i.id^Wn W • • • dF Wl (w x ) 

n 

II m^fiEiv^jc^)) 

jg{ni,...,n fc } i=l 

j=l i=l MVniiBnJ 

- fc 

= / n F ^ix(Qix)d MX (x). 



It is not difficult to check that the posterior distribution found in Theo- 
rem 1.115 still satisfies the two conditions (1.107) and (1.108). Hence, it is 
a regular conditional distribution. 



66 Chapter 1. Probability Models 



Example 1.120. Suppose that X — IR and each set B is an interval. In the 
proof of Theorem 1.115, suppose that one of the V % is just the single coordinate 
V Ui (x). Also, suppose that this V ni (x) has a prior density f(v) with respect 
to some measure (like Lebesgue measure). Then (1.117) says that the posterior 
density of V ni (x) (aside from a normalizing constant) is vf(v). If V is another 
random variable from partition rii, and if {V' ,V ni (x)) have joint prior density 
g(v' , u), then their joint posterior is proportional to vg(y' ,v). 

Tailfree processes are more general than Dirichlet processes. In fact, 
they can be continuous or even absolutely continuous with respect to non- 
discrete measures with probability one. Mauldin, Sudderth, and Williams 
(1992) prove a result giving conditions under which P is continuous with 
probability 1. (See Problem 57 on page 81.) Kraft (1964) and Metivier 
(1971) proved theorems that said that if X — (0, 1] and 7r n has 2 n sets, 
and if a few other conditions hold, then P has a density with respect to 
Lebesgue measure with probability 1. We generalize these latter theorems 
to arbitrary tailfree processes. 

Theorem 1.121. Let (X,B,v) be a probability space. Assume that P is 
tailfree with respect to ({7Tn}£Li> : n > 1,B € 7r n }). Assume that 

each element of each 7r n has positive v measure and that, for all n and 
B e 7r n , E(V n -B) = v(B)/v(ps(B)). For each n and each x € X, define 

Ifsup n f x E[f^(x)]di/(x) < co, then, with probability 1, 

1. lim n _>oo f n (#) = f(#) exists and is finite a.e. [v], and 

2. V{A) = f A f(x)dv{x), for all AeB. 

Before proving Theorem 1.121, we should say a little about its statement. 
The condition that E(V n]B ) = v(B)/v{ps(B)) is equivalent to saying that 
v = nx in (1.112). (See Problem 50 on page 81.) The formula for f n (x) 
is nothing more than the formula for P(C n {x))/v(C n (x)). Hence, f n is the 
density (with respect to v) corresponding to an approximation to P which 
ignores the fine detail on sets in partitions past n. In other words, f n is 
constant on all sets in ?r n and is a density for P restricted to the smallest 
a-field containing all sets in partitions up to n. Since E(f(x)) = 1 for all 
n and x, the theorem says that if there is not too much variation in the 
approximate densities, then the approximate densities converge to a density 
for P 

Proof of Theorem 1.121. 38 Consider the probability space (5 x X,A® 
B,n*v). For part (1), let T n be the product a-field of the a-field generated 



38 The proof makes use of martingale theory. 



1.6. Infinite-Dimensional Parameters 



67 



by {V n; B : B e 7Ti,i < n} with the cr-field B. Then, as a function from 
S x X to 1R, f n is measurable with respect to T n . Also, since the Vi{x) are 
independent for fixed x, 



So, the stochastic process {(fn(aO,^n)}£Li is a martingale. Since 

^ f n (x)dfi x i/(s,:r) = 1 for all n, 

the Martingale convergence theorem B.117 implies that lim n _>oo f n = f 
exists and is finite a.s. [fx x i/], which means (in terms of the original prob- 
ability space) that with probability 1, lim^oo f n {x) = f(x) exists and is 
finite a.e. [v]. 

For part (2), we first show that the sequence {f n }£° =1 is uniformly inte- 
grable with respect to /x x v\ 

I f n (x)dfi x v{s,x) 

J{(s,x):f n (x)>m} 

m,oo) {f n (x))dfi(s)di/(x) 

J x J s 

< I I yi{x)dn{s)dv(x) 
J X J s 171 

= ~ J x nfn(x)]du(x), (1.122) 

where the first equation follows from Tonelli's theorem A.69, and the in- 
equality follows since / (m , oo) (f n (x)) < f n (x)/m. By assumption, the supre- 
mum over n of the last expression in (1.122) is a finite number divided by 
m, which goes to 0 with m, so the sequence is uniformly integrable. 

Next, we prove that f is a density with respect to v with probability 1 
By Theorem A.60, we have 

n^L J f "( a; )^ x "(*) = J t(x)dn x v {x). (1.123) 

Since the left-hand side is 1, we have that f is integrable. It follows that 
if A e A ® B, then I A {s,x)\f n (x) - f(x)\ is uniformly integrable. Let A = 
B x X, where 



= js: jf f(*)<fa/(x)>lj. 



Then limn-,,,,, J A i n {x)d l t x v(s,x) = f A t(x)dn x v(a,x). The left-hand 
side is just n(B) = Pt(B) since f„ is a density. But the right-hand side 



68 Chapter 1. Probability Models 

is greater than fi(B) if /x(B) > 0, so the integral of f is at most 1 with 
probability 1. But / f(x)dn x v(s,x) = 1 from (1.123), so J x f(x)dv(x) = 1 
with probability 1. It follows from Scheffe's theorem B.79 that part (2) is 
true. O 
There is a convenient way to check the condition sup n f x E[f^(x)]du(x) < 
oo in Theorem 1.121. 

Lemma 1.124. // sup B67rn Var(V r n;B )/(EV n;B ) 2 < oo, then 

sup / E[f%(x)]dv(x) < oo. 

n JX 

Proof. Since the set in 7r n to which x belongs is not random and E(V n (x)) 
— E(V n] B) for all x G 5, we have 



• * Prim-*) 



VarVn(x) 



EVar(V n;B ) 
SUP (vv \ 2 < °°- 

Note that log(y) < y - 1 for all y > 0. With y = snp x E[V n 2 (x)]/(EV„(x)) 2 , 
we get, for each n, 

, E[V 2 (x)] E[V„ 2 (x)] n 

hence 

/ ~ E[V 2 (x)l \ A E[V„ 2 (x)] 

log (n-P(^) - E-p log (» 



~ f E[y„ 2 (x)] \ 

n=l 



< OO. 

2 



Since the Z n (B) are independent, we get that fHU EV n (a?) 2 /(EV r n (a:)) 
equals Ef*(x). Hence, sup x Et*(x) is integrable. □ 
The following simple corollaries follow from Tonelli's theorem A.69. 

Corollary 1.125. Let X ~ P given P . J/P < f with probability 1, then 
the predictive distribution of X has density with respect to v equal to the 
mean of dP/dv. 

Corollary 1.126. // {X n }%L\ are conditionally IID with distribution P 
given P and P < v with probability 1, then conditional on Xi,...,X n 
P < v with probability 1, a.s. with respect to the joint distribution of 
X\ , . . . , X n . 



1.6. Infinite-Dimensional Parameters 69 



At this point, we will introduce one special class of tailfree processes 
which includes Dirichlet processes as special cases but also includes cases 
that satisfy the conditions of Theorem 1.121. The class is called Poly a 
tree distributions. A good introduction to these processes is contained in 
the papers by Mauldin, Sudderth, and Williams (1992) and Lavine (1992). 
[See also Mauldin and Williams (1990)]. 

Definition 1.127. Let P be tailfree with respect to ({n n }n=ii i v n;B : n > 
1, B e 7r n }) and suppose that 

• for each n > 0 and each B e 7r n , there are exactly k sets in 7r n+ i, 
Bi(B), . . . ,B*(B) such that B = <ps{Bi{B)), 

• for each n > 0 and each B e 7r n , the joint distribution of {V n ^Bi{B) : 
i = 1, ...,*} is Dirichlet, and they are independent for different B e 

then P has Polya tree distribution. 

Note that each partition 7r n has k n elements. It is possible to allow some 
of the partition elements to be 0 so that there are fewer than k n non- 
empty elements of 7r n , but then the Dirichlet distributions would have to 
be partially degenerate in the sense that some coordinates would have to 
be 0 with probability 1. 

The posterior distribution of a Polya tree process P given an observation 
X can be determined by examining the step in the proof of Theorem 1.115 
in which the posterior is given, namely (1.117). For each n and each x € 
let ; 

W n (x) = (V n;Bl(Cn _ l{x)) , V n . tBk{Cn _ l{x)) ), (1.128) 
in the notation of Definitions 1.127 and 1.106. This is the vector of ran- 
dom variables for partition n n corresponding to the subsets of C„_i(». 
According to (1.117), the posterior distribution of W n (x) given X = x has 
a density with respect to the prior distribution equal to v x /E(V n (x)), where 
v x is the dummy variable for the coordinate corresponding to VJx), which 
is a coordinate 0 f W n (x). If Dir k (a n>1 (x), . . . , a n , k (x)) is the prior distribu- 
tion of W n {x), then the posterior is Dirichlet with the ith parameter equal 
to 

= O n ,i(x) + I Bi (C n ^x)){x). (1.129) 

For all of the other V random variables corresponding to partition 7r n , the 
posterior distributions are the same as the prior distributions. In summary 
the posterior distributions of the V random variables are the same as the 
priors for all Vs corresponding to sets such that x is not in the most re- 
cent superset. For those sets such that x is in the most recent superset, 
the distribution of the vector of V random variables is Dirichlet with the 
same parameters as in the prior except for the set in which x lies, whose 
parameter is one higher than in the prior. Note that this is the same thing 



70 Chapter 1. Probability Models 



that happens in the Dirichlet process. The difference is that, for a Dirichlet 
process, the above argument applies to every set, and every set is in the 
first partition 7Ti for the Dirichlet process. Given this description of the 
posterior, we can use Corollary 1.125 to construct the predictive density of 
a future observation X n +i given the observed values of X\, . . . , X n . 

Example 1.130. Let P have a Polya tree distribution. Suppose that the condi- 
tions of Theorem 1.121 hold and that P v with probability 1. Let X\ > • • • » Xm 
be the observed values of the first m random quantities. For each x 6 X and each 
n such that x is not in the same element of 7r n _i as one of the Xi, 



E[V n (x)|Xi = Xi, . . . , X m = Xm] = E[V n (*)] = 

It follows that for all such n and x, 



u{C n (x)) 
u{pa(C n (x))' 



i=i 



where r is the first integer such that x is not in the same element of 7r r -i with 
any of the Xi, and gi(x) is the posterior mean of Vi(x) given the observed data. 
It follows from Tonelli's theorem A.69 that Ef(x) equals (1.131). 

We can actually find an explicit formula for gi(x). Using the same notation 
as above, suppose that V^x) is coordinate i n (x) of W n (x) and that W n (x) has 
Dirk(a n ,i(x), . . . ,a n)fc (x)) prior distribution. Then, the posterior distribution of 
V n {x) is Beta(a,b) with 



a = a n ,i n ( x ) + ^2 Ic n (x)(xj), 

m 

e^i n (x) j=l 

That is, the first parameter of the posterior beta distribution equals the prior 
parameter plus the number of observations that are in the same partition set 
as x. The second parameter of the posterior beta distribution equals the prior 
parameter plus the number of observations that were in the same partition set 
as x in the most recent partition but now are not in the same partition set as x. 
It follows that g n {x) — a /{a + 6). 

For the special case of X = [0, 1], k = 2 and a na (x) = c u a n , 2 (x) = c 2 for 
all n and x, Dubins and Freedman (1963) show that P, although continuous, 
has no density. That is, the distribution is not absolutely continuous with 
respect to Lebesgue measure. For Polya trees with a n>i (x) = c n for all n, 
i, and a;, Var(V n (x)) = (fc - l)/[fc 2 (c n + 1)]. Lemma 1.124 says that if 
Z^=i V c n converges, then P is absolutely continuous with respect to v 
with probability 1. If v is absolutely continuous with respect to Lebesgue 
measure, this gives us an easy way to construct Polya tree processes that 
have densities with respect to Lebesgue measure. 



1.6. Infinite-Dimensional Parameters 71 



Example 1.132. Let {J n }S°=i be an exchangeable sequence such that the prior 
marginal distribution of each Xi is AT(0, 100). Let Yi = $(Xi/10), where $ is 
the AT(0, 1) CDF. The prior marginal distribution of the Yi is J7(0, 1). Suppose 
that we model the Yi as conditionally IID given P and that P has a Polya 
tree process distribution on [0,1] with k = 2 and a n ,i( x ) = n V 2 f° r all n, 
i, and x. This is a special case of Example 1.114 on page 63, hence each Yi 
has marginal distribution C/(0, 1). Fifty observations X\, . . . ,X 5 o were simulated 
from a Laplace distribution Lap(l, 1), which does not look much like the prior 
marginal distribution AT(0, 100). The posterior mean of the density of P(10$~ 2 ) 
was computed and is plotted in Figure 1.133 together with the prior marginal 
mean of the density and a histogram of the data values. The posterior mean of 
the density of P(10$ _1 ) is high where the data values are close together, as is 
to be expected. The posterior mean smoothes out some of the ups and downs 
in the histogram, especially those in the tails. The reason it smoothes the tails 
a bit more than the center of the distribution is that the partition sets in the 
tail which do and do not contain observations only belong to partitions 7r n for 
relatively large n. A few observations do not have much impact on the posterior 
distribution of V n . B for large n because the prior is Beta(n 2 /2, n 2 /2). In the center 
of the distribution, however, the different partition sets belong to the same 7r n 
for smaller values of n. 

It is interesting to note that there is a similarity between the posterior 
distributions from Polya tree processes and Dirichlet processes. Let P x 
have Dir(a) distribution, and let P 2 have Polya tree process with a M = 
a(Bi,<) for i = l,...,fc. Then, it is easy to check that for every data 
set, the posterior distributions of Pj(B hl ), . . .,P j (B hk ) are the same for 
j = 1, 2. That is, for sets in the first partition tt u the Polya tree process 
looks just like a Dirichlet process. For Dirichlet processes, two disjoint sets 



Histogram 
Prior Mean 
Post. Mean 
True Dens. 



Figure 1.133. Posterior Mean of Polya Tree Density 



72 Chapter 1. Probability Models 



have posterior distributions that depend in no way on where the two sets 
are located relative to each other. The same is true for elements of the first 
partition in a Polya tree process. For Polya tree processes, two sets in the 
nth partition will have their probabilities more closely related when they 
share more superset partition sets. For example, two subsets of will 
be more closely related than a subset of Bi^ and a subset of Bi^> Two 
subsets of jB 2 ,i Q #i,i will be more closely related than a subset of J3 2 ,i 
and a subset of J5 2 ,2 Q #1,1, even though both are subsets of B^i. 

One potential problem with tailfree process priors is the dependence 
on the sequence of partitions. One consequence of this dependence is easily 
seen in Figure 1.133. The tall vertical lines in the posterior mean plot occur 
at boundaries of sets in early partitions. The following example explores 
this in more detail. 

Example 1.134. Suppose that X = [0, 1] and we use a Polya tree prior with 
k = 2 and a n ,i(x) = n 2 /2 for all n, i, and x. Suppose that X\ = 0.49. The 
predictive density of X 2 at the value x = 0.51 is calculated as in Example 1.130 
on page 70. It is 

/x 2 ,x 1 (0.51|0.49) = ^^=0.5. 
On the other hand, the predictive density of X2 at x = 0.47 is 

/ X2|Xl (0.4710.49) = (J} J^) ^Tli = 2.1183. 

Note that in each of these cases, the proposed value of X2 differs from the observed 
X\ by 0.02 and they are all in the vicinity of 0.5, and yet the first predictive 
density is so much smaller than the second. The reason is the following. In the 
first case, the two data values share no partition sets in common, not even in 7Ti. 
In the second case, the two data values share the same partition set for the first 
five partitions. In symbols, Ci(0.49) ^ Ci(0.51) while C n (0.47) = C n (0.49) for 
n = 1, . . . , 5. Sharing partition sets is what makes predictive densities large. 

One way to reduce the effect of the problem illustrated in Example 1.134 
is to use a mixture of tailfree priors with partitions that have no common 
boundaries. 

Example 1.135 (Continuation of Example 1.134). Suppose that we use a half- 
and-half mixture of two Polya tree priors with k = 2, 3 and a n ,i(x) — n /k. After 
some tedious algebra, one calculates the two predictive densities as 

fx 2 \ Xl (0.51|0.49) = 1.8312, 
fx 2 \ Xl (0.47|0.49) = 2.3191. 

The reason that the first density is now almost as high as the second is that 0.49 
and 0.51 appear together in one more partition set in the k = 3 prior than do 
0.49 and 0.47. The densities are higher than with k = 2 alone because a prior 
with larger k tends to let the density track the data more. With values so close 
together as the ones in this example, the prior with k = 3 has very high posterior 
mean of f (x) for x near 0.49 and very low mean for x not near 0.49. (Using the 
k = 3 prior alone, the two predictive density values would have been 3.1624 and 
2.5200, respectively.) 



1.7. Problems 73 



1.7 Problems 



Throughout this text problems are given and the following type of expres- 
sion is often used: "Suppose that (some random quantities) are condition- 
ally independent given 0 = 0." This will mean that, for all 6 in some 
parameter space (implicit or explicit), the random quantities are condi- 
tionally independent given 0 = 6 with some distributions to be specified 
later in the problem. Some of the more challenging problems throughout 
the text have been identified with an asterisk (*) after the problem number. 

Section 1.2: 



1. Let X\, X 2 ,Xs be random variables whose joint distribution is given by 

Pv(Xx = 1, X 2 = 1, X 3 = 0) = Pr(Xi = l,X 2 = 0, X z = 1) 

= Pr(Xi=0,X 2 = l,X 3 = l) = i. 

(a) Prove that Xi,X 2 ,X 3 are exchangeable. 

(b) Prove that if X A € {0, 1} is another random variable, then it cannot 
be that X\,X 2 ,Xz, X* are exchangeable. 

2. For each positive integer n, let F n be the joint CDF of n random variables. 
Suppose that the following two conditions hold: 

• The sequence of n-dimensional CDFs {F n }£° =1 is consistent. (See Def- 
inition B.132 on page 652.) 

• For each n, each n-tuple (a* , . . . , x n ), and each permutation (y u . . . , 
Vn) of On, . . . , Xn ), F n (a?! , . ..,*„) = F n (yi, . . . , y n ). 

Prove that there is a sequence of random variables {X n }^ =1 that are ex- 
changeable and such that F n is the joint CDF of X u ...,X n for every 
n. 

3. Suppose that {JT n }~ =1 are exchangeable. Let Y { = X n+i for i = 1,2, . . .. 
Show that {Y n }n=i are exchangeable conditional on X u ...,X n . 

4. Suppose that {X n }£° =1 are conditionally IID given Y. Prove that they are 
exchangeable. 

5. Suppose that {X n }%> =1 are IID JV(p, 1) conditional on M = M , and M ~ 
N(Q, 1). Find the joint distribution of every subset of size k of the X { and 
show that the Xi are exchangeable. Also, find the conditional distribution 
of {Xn+fc}^! given Xi = a?i, . . . ,X n = x n . 

6. In Example 1.15 on page 9, prove that the limit (as n — oo) of the 
empirical probability measures of X x , . . . , X n is the Dirichlet distribution 
Dir k (u u ...,u k ). 

7. Let {Y n }Sf=i be IID random variables with CDF F. Let Z be independent 
of WSLi with CDF G, and let X n = Y n + Z for every n. 

(a) Prove that {X n }£° =1 are exchangeable. 

(b) Write the joint CDF of {X u . . . , X„) in terms of F and G. 



74 Chapter 1. Probability Models 



Section 1.3: 



8. Let Xi, . . . , X m be numerical characteristics of m individuals in a finite 
population. Suppose that we are interested in S = the population 
total. We model the Xi as exchangeable as follows. Let 0 be a parameter 
such that conditional on 6 = 0, the Xi are IID with Exp(6) distribution 
and 6 has T(a, b) distribution. Suppose that we observe X\ = x\, . . . , X n = 
x n for some n < m. Find the predictive distribution of S given this data. 
Also, find the mean of this predictive distribution. 

9. Let 0 be a parameter with parameter space (ft,r), and let fx\e(x\6) = 
dPe/dv for every 6. Let be a prior on (Q,r). Let Q be the joint distri- 
bution of (X, 0). Show that /x|e = dQ/dv x /ze, a.s. [v x fie]. This says 
that, for every prior distribution, we can find a version of fx\e which is 
jointly measurable in 

10. Prove that the formula on the right-hand side of (1.37) on page 18 is the 
same as 

fx l ,...,X n + k (Xl, . . .,X n +fc) 
/*!,... ,X n (£l,...,X n ) 

11. Suppose that for every m = 1,2, . . ., fx x , ...,x m (zi, • • • ,Xm) equals 

J i^ElV^f 10 -^' 1 if all a* €{0,1}, 
1 0 otherwise, 

where x = J^jLi Xi ' an< ^ numDers a i are nonnegative and add to 1. 
Let 0 = lim n ^oo ^^ =1 Xi/n. Prove that the prior distribution of 0 is 
Pr(0 = i/10) = en for i = 0, ... , 10. 

12. Suppose that for every m = 1, 2, . . ., 

S * x m (x 1 ,...,x m )= (m+1)cm( ^ xm)m+1 , ifallx^O, 

where c m (xi, . . . , x m ) = max{2, xi, . . . , a; m }- 

(a) Prove that the Xi are exchangeable and that these distributions are 
consistent. 

(b) Let Y n = c n (Xi, . . . , X n ). Find the distribution of y n and the limit 
of this distribution as n — ► oo. 

(c) Find the conditional density of X„+i given X\ = #i, . . . , X n = x n , 
and assume that lim n -oo c n (xi, . . . ,x n ) = ^- Find the limit of the 
conditional density as n — > oo. 

(d) Use DeFinetti's representation theorem to show that the prior (the 
answer to part (b)) and likelihood (the answer to part (c)) combine 
to give the original joint distribution. 

13. Let 0 be a random variable. Suppose that {X n }n°=i are IID JV(0, 1). Let 
T(0) = 0. For each i > 0, let T(i) be the first j > T(i - 1) such that 
Xj > 0. Let Yi = X T( i) for * = 1, 2, .... If we use Lebesgue measure as an 
improper prior distribution for 0, how many observations must we observe 
before the posterior becomes proper? 



1.7. Problems 75 



14. An observation X is to be made in hopes of learning something about a 
parameter G. The prior distribution of 0 has some density /e, but we 
are not certain what is the appropriate distribution to use for X given 
B. Suppose that we have k choices with densities . . . , fk{-\0). Let 

tti, . . . ,7Tfc be nonnegative numbers adding to 1, and set 



/X|©(Z|0) = ^TTi/iCxIfl). 
i-l 

Show that there are numbers n* (x), . . . , 7r£(x) adding to 1 such that 

k 
i-l 

where p*(-|x) is the posterior density of O that we would have calculated 
if we had used fi{-\6) as the conditional density for X given 0 = 0. 

15. Let {X n }n=i be IID Ber(0) random variables given 0 = 6. Define a prior 
/xe for 0 by fie(B) = [6(B) + n(B)]/2 where n is Lebesgue measure and 
6(B) = / B (l/2). 

(a) Find the marginal distribution for X H , . . . , X in for distinct integers 

(b) Find the posterior distribution for 0 given X\ = xi, . . . , X n = x n , 
that is, find Pr(0 < 9\Xi = x X) X n = x n ). 

16. Suppose that {X n }n=i are conditionally independent random variables 
with Xi~N(iH 9 l) given (M, Mi, . . . , M„, . . .) = . . . ,/i n , . . Sup- 
pose also that given M = /x, 







/ 



























for each n, and M ~ N(fj, 0i 1). 

(a) Prove that {X n }^ =1 are exchangeable. 

(b) Find a one-dimensional random variable 0 such that the X t are con- 
ditionally IID given 0 = 0, and find the distribution of 0. 

(c) Show that X n = ^^ l ==1 X;/n does not converge in probability to M. 

17. Let c be a constant, and let X, y be conditionally independent given 9 = 0 
with X ~ Poi{6), Y - Poi(cO). Let 0 ~ r(a 0 , /3b). 

(a) Find the posterior distribution of 0 given X = x. 

(b) Find the posterior predictive distribution of y given X = x, and show 
that it is a member of the negative binomial family. 



76 Chapter 1. Probability Models 



18. Suppose that an expert believes that {X n }%Li are exchangeable Bernoulli 
random variables. Let G = lim n — <x> X^=i ^V n > an( * assume that a statis- 
tician wishes to model 6 ~ Beta(a,b). The statistician tries to elicit the 
values of a and b from the expert by asking questions like, "What is the 
probability that Xi = 1?" and "How many Xi = 1 in a row would you have 
to observe before you would raise the probability that the next Xj = 1 up 
to q?" Suppose that the answer to the first question is p, and suppose that 
q in the second question is chosen to be (1 + p)/2. Let the answer to the 
second question be m. 

(a) Find values for a and b which are consistent with the model. 

(b) Find the partial derivatives of a and b with respect to p, and find the 
effects of a change of ±1 in m on both a and 6. 

(c) Suppose that the second question above was changed to "If you were 
to observe Xi = • • • = Xio = 1, what would you give as the Pr(Xn = 
1)?" Let the answer to this question be r. Find values of a and 6 
consistent with the values of p and r, and find the partial derivatives 
of a and b with respect to p and r. 

19. Suppose that an expert believes that {X n }^Li are conditionally IID with 
N(/jL,a 2 ) distribution given (M, E) = (/z,a). A statistician wishes to model 
M ~ jV(/i 0 ,<7 2 /A 0 ) given E = a and E 2 ~ r~ 1 (a 0 /2, 6 0 /2). The statistician 
tries to elicit the values of Mo,Ao,ao, and &o from the expert by asking a 
sequence of questions such as: 

• What is the median of the distribution of Xi? (Suppose that the 
answer is m.) 

• Given that X\ > iii, what is the conditional median of the distribu- 
tion of Xi? (Suppose that the answer is U2.) 

• Given that X\ > v,2> what is the conditional median of the distribu- 
tion of Xi? (Suppose that the answer is m.) 

• If Xi = U2 is observed, what would be the conditional median of the 
distribution of X 2 ? (Suppose that the answer is U4.) 

(a) Prove that the following constraints on u\ , ^2, ^3, and ua are sufficient 
for there to exist a prior distribution of the desired form consistent 
with the responses: u\ < ua < u* and (113 -u\)j(ui -U\) > 1.705511. 
(The constraints are actually necessary as well.) 

(b) Suppose that the following answers are given: u\ = 14.56, U2 = 21.34, 
it 3 = 29.47, and u 4 = 19.25. Find values of /xo,A 0 ,a 0 , and b 0 which 
are consistent with these answers. 

Section 1.4>' 

20. For the joint density in Example 1.53 on page 29, prove that the distribu- 
tions are consistent (as n changes). 

21. Consider the joint density in Example 1.53 on page 29, and define Y n = 
n/YTi=\ Xi ' Find the distribution of Y n . Also, let n -> 00 and prove that 
the limit of the distribution of Y n is T(a, 6). 



1.7. Problems 77 



22. For the joint density in Example 1.53, prove that the conditional distribu- 
tion of the Xi given 0 = 0, namely Exp(0), and the marginal distribution 
of 9, found in Problem 21, namely T(a, 6), do indeed induce the joint distri- 
bution for X\ , . . . , X n for all n. How do we know that no other combination 
of distributions will induce the joint distribution of the Xi? 

23. Let X\ , . . . , X14 be exchangeable Bernoulli random variables, and let M = 
ILhLi Xi- Let the distribution of M be given by the mass function (density 
with respect to counting measure) 



(a) Find the probability that in four specific trials, we observe three suc- 
cesses and one failure (without regard to which of the trials is a failure 
and which are successes). 

(b) Suppose that we observe three successes in the first four trials. Find 
all the probabilities of k successes in n future trials for n = 1, . . . , 10 
and k = 0, . . . , n. (Give a formula.) 

24. Suppose that X\,...,X n are exchangeable and take values in the Borel 
space (X,B). Prove that the empirical probability measure P n is a mea- 
surable function from the n-fold product space (X n ,B n ) to (V,Cv). 

25. *Refer to Problem 29 on page 664. 

(a) Find the distribution of 0 = lim n -oo £" =1 Xi/n. 

(b) Assume that we observe X x = 1 and X 2 = 0. Find the conditional 
distribution of X 3 , . . . , X n given this data for all n = 3, 4, . . .. 

(c) Using the same data, find the conditional distribution of 0. 

26. Suppose that {X n }%Li is an infinite sequence of exchangeable random vari- 
ables with finite variance. 

(a) Prove that the covariance of X t with Xj is nonnegative for i ^ j. 

(b) Give an example of such a sequence in which Cov(X i ,X j ) = 0 but 
the random variables are not mutually independent. 

27. *Let Xi,X 2 ,X 3 be IID C/(0, 1) random variables. After observing Xi = 

xi,X 2 = x 2 , X 3 = x 3 , define Yi, Y 2 to be the results of drawing two num- 
bers at random without replacement from the set {xi,x 2 ,x 3 \. Prove that 
Yi andY 2 are IID 1/(0,1). 

28. Let X u . . . , X n be IID with some distribution P on a Borel space (AT, B). 
Let the conditional distribution of Yi,...,y fc (for k < n) given Xi = 
#i,...,X n = x n be that of k draws without replacement from the set 
{xi , . . . , x n }. Prove that the joint distribution of Yi , . . . , Y k is that of IID 
random quantities with distribution P. 

29. Let (X ,B) bea Borel space, and let Xi take values in X for i = 1, . . . , n. 
Suppose that Xi,...,X n are exchangeable. Let the conditional distribu- 
tion of Yi,...,Y fc (for k < n) given Xi = xi,...,X n = x n be that of k 
draws without replacement from the set {x u . . . , x n }. Prove that the joint 
distribution of Yi , . . . , Y k is the same as the joint distribution of Xi , . . . , X k . 



/ M (m) = < 0.2 if m = 8, 
I 0.5 if m = 13. 




78 



Chapter 1. Probability Models 



30. In the setup of Problem 3 on page 73, let P be the limit of the empirical 
probability measures of X\, . . . , X n as n — ► oo. Show that P is also the 
limit of the empirical probability measures of the Yi. 

31. State and prove a central limit theorem for exchangeable random variables. 
You may use Theorems B.97 and 1.49. 

Section 1.5: 

32. Prove Corollary 1.63 on page 36 using Theorem 1.62. (Hint: Prove that 
limY n is measurable with respect to the tail a-field of {X n }£° =1 . Then 
apply the Kolmogorov zero-one law B.68.) 

33. Refer to Example 1.76 on page 42. Let M* = M-X\. Take the conditional 
distribution of M* given Xi = x\ as a prior distribution for M* after 
learning that Xi = x\ . 

(a) Find the probability that X2 = 0 (conditional on Xi = 1) using this 
new prior distribution. 

(b) Find the posterior distribution for M* given X2 = 0 (and Xi = 1.) 

34. Let {X n }S° = i be exchangeable Bernoulli random variables, and let 



that is, Y is the time until the second success (e.g., if Xi = 1, X2 = 0, X3 = 
1, thenY = 3). 

(a) Find the distribution of Y using the form of DeFinetti's representa- 
tion theorem in Example 1.82 on page 46. 

(b) Find the conditional distribution of {X n +k}kLi given Y = n. 

(c) Show that the distribution in part (refp202) is the same as the con- 
ditional distribution of {X n +i}fcli given ]T]" =1 Xi = 2. 

35. Suppose that {X n }~ =1 are bounded, exchangeable random variables. Let 
9 = lim n _oo E^i Xi / n > a s ' Prove that Var (°) = Cov(Xi,X 2 ). 

36. Prove that the collection C n in the proof of Theorem 1.62 is a a-field. Also 
prove that / : IR 00 -> IR is measurable with respect to C n if and only if 
f(y) = f(x) for all y that agree with x after coordinate n and such that 
the first n coordinates of y are a permutation of the first n coordinates of 
x. 

37. Let {X n }n=\ be IID nonnegative random variables with E(Xi) = 00. Show 
that Yli=i Xi / N = Yn diver g es to 00 almost surely. 

38. *Under the conditions of Theorem 1.59 it is possible to prove that Y n con- 

verges almost surely, rather than just the subsequence {Vn fc }fcLi- 



(a) Let v = E"i r3/2 and c ^ k = l/(*f* 3/2 ) for a11 * and k ' Define 
Vi,k = {s : |V ( i + i)4(5)-y i 4(5)| < Ci.fc}. Use the second to last equation 

in (1.60) to prove that ^ 1 Pr(V; c fc ) < 00. 




1.7. Problems 79 



(b) Let Ak, n = n- =n Vi j fc. Show that for each e > 0 and /c, there exists rik 
such that Pr(yl£ nfc ) < e/2*. 

(c) For each z, with i 4 < k < (i + l) 4 , define Gi l<7 -,fc = {5 : \Yj(s) — 
3^4 (s)| > l/k}. Use the second to last equation in (1.60) to prove 
that Pr(Gij,fc) is at most a fixed multiple of k 2 /i 5 . 

(d) Define H iyk = U^i^G^-,*. Prove that Pr(Jf <>fc ) is at most a fixed 
multiple of k 2 /i 2 . 

(e) Define Jk, n = U^/fi,*;. Prove that for each k and € > 0, there exists 
rrik such that Pr(Jfc, mfc ) < e/2 k . 

(f) Prove that for every pair of sequences {rik}kLi and {mk}k^=i, 

(n/^)n(n^) 

C {s : {^(s)}^! is a Cauchy sequence}. 

(g) Prove that for every 0 < c < 1, the probability that {y n }£=i is a 
Cauchy sequence is at least c, hence it must be 1. 

39. State and prove a weak law of large numbers for exchangeable random 
variables. You may use Theorems B.95 and 1.49. (Don't use Theorem 1.62.) 

40. *Suppose that {X n }%Li is a sequence of exchangeable random variables 

with finite mean. Let {nk}kLi be a subsequence of {1,2,...}. Prove that 

1 n 1 ^ 

lim — Xi = lim — X ni , a.s. 

n->oo Tl ^— ' fc— 00 K 

i=l i=l 

(Hint: Use DeFinetti's representation theorem and the strong law of large 
numbers 1.62 to express the two limits as the same function of P.) 

41. *In this problem, you will prove the following generalization of a theorem of 

Aldous (1981): An infinite sequence of random quantities {X n }^Li taking 
values in a Borel space X is exchangeable if and only if there exists a 
measurable function / : [0, l] 2 — > X such that X — [X\,X<i, . . .) has the 
same distribution as (/(Zo, Z\), }{Zq, Z2), . . .)> where Zo,Zi,... are IID 
t/(0,l). 

(a) Let P be the limit of the empirical distributions of {X n }%Li, and 
let P* be the limit of the empirical distributions of {X2 n }5?Li- Use 
Problem 40 on this page and Problem 31 on page 664 to show that 
P* = P, a.s. 

(b) Note that P is a measurable function on X°°. Use Proposition B.145 
and Lemma B.41 to show that there exists a measurable function 
g : [0, 1] — ► V such that g(Zo) has the same distribution as P. 

(c) Map X to JR and for each z map g(z) to a CDF F z . Then define 

p-i/Vi - / inf { x : 9(z){x) >q} if q > 0, 
z W " \ sup{x : g(z)(x) > 0} if q = 0. 

Find a function ip : IR — ► X such that { ( 0(^z o 1 (^))}i=i has the same 
joint distribution as {X2i+ 



80 Chapter 1. Probability Models 



(d) Show that f(z,w) = ip(F z l (w)) is the desired function. 
Section 1.6.1: 

42. Prove Proposition 1.98 on page 56. 

43. Suppose that {X n }£Li are conditionally independent with distribution P 
given P = P, and P has Dir(a) distribution where a is a finite measure 
on (X,B). Let K n be the number of distinct values amongst X\, . . . , X n . 
Prove that lim n — oo E(K n )/ log(n) = a(X). 

44. Consider the situation described in Example 1.103 on page 59. Consider 
two different choices for a$. The first is a$ equal to the £7(0,0) density. 
The second is Lebesgue measure on [0,0]. 

(a) If no repeats occur in the data, show that the likelihood function for 
0 is the same for both choices of olq. 

(b) Explain the differences between the two likelihood functions in the 
case in which repeats do occur in the data. 

45. Suppose that P has Dir(a) distribution and that X\ and X2 are random 
variables that are conditionally IID with distribution P given P = P. 
Suppose that a is absolutely continuous with respect to Lebesgue measure 
with Radon-Nikodym derivative a(-). Find the joint density of (Xi,X2) 
with respect to the measure, which is two-dimensional Lebesgue measure 
on A = {(xi,X2) : x\ ^ #2} plus one-dimensional Lebesgue measure on 
B = {(£1,22) : xi = £2}. (For definiteness, calculate one-dimensional 
Lebesgue measure of a subset CCBby calculating the Lebesgue measure 
of the set C\ = {x : (x, x) G C}. The most natural alternative would be to 
multiply this measure by y/2 so that it equaled length for line segments.) 

46. Suppose that P has Dir(a) distribution, where a is a finite measure on 
(IR, B). Let Z be the median of P, namely Z = mi{x : P((-oo, x]) > 1/2}. 
Show that the median of Z is the median of the distribution a/a(lR). (Hint- 
Use the result of Problem 30 on page 664.) 

47. Let P ~ Dir(a), where a is continuous (no point masses). Prove that the 
posterior distribution of P is not absolutely continuous with respect to 
the prior. In fact they are mutually singular (that is, the posterior assigns 
probability 1 to a set to which the prior assigns probability 0). 

Section 1.6.2: 

48. Prove Proposition 1.111 on page 62. 

49. Let P be a random probability measure on (X,B). Let Xi : S — ► X be 
random quantities that are conditionally IID given P with distribution P. 
Let /xx be the marginal distribution of Xi. We will say that a probability 
measure P is continuous if P(Xi = X 2 ) = 0. Prove that P is continuous 
with probability 1 if and only if Pr(X 2 = x\Xi = x) = 0, a.s. [/ix]. 



1.7. Problems 81 



50. Let (X, B, v) be a probability space. Assume that P is tailfree with respect 
to ({^n}^!, {V n -B : n > 7r n }). Assume that each element of each 
7r n has positive v measure. Show that E(Ki ; b) = v(B)/v(ps(B)) for all 
n and B if and only if jix(B) = v(B) for all B, where fix is defined in 
(1.112). 

51. *Prove that (1.107) and (1.108) are necessary and sufficient in order that a 

tailfree process be a probability measure with probability 1. (Hint: That 
the two conditions are necessary is straightforward once one realizes that 
countable additivity of a measure fi is equivalent to lim n —oo fi(D n ) = 0 
when DiD D 2 2 " and n^ =i Dn = 0. Next prove that (1.107) is sufficient 
for the process to be countably additive on the smallest a-field containing 
all sets in ir n for each n. Then show that the union of all these cr-fields is a 
field and that (1.108) is sufficient for the process to be countably additive 
on this field.) 

52. Let (Xi, . . . ,X n +i) ~ Dir n +i(ai, . . . ,a n +i). Define Yi = X^ = i ^ f° r * = 
1, . . . , n 4- 1, and set Zi = Yi/YJ+i for i = 1, . . . , n. Prove that the are 
mutually independent with Zj ~ Beta(a\ H h ai,cii+i). 

53. Prove Corollary 1.125 on page 68. 

54. Prove Corollary 1.126 on page 68. 

55. *Assume that the conditions of Theorem 1.121 hold. Prove that the posterior 

distribution of P as found in Theorem 1.115 is the same thing that Bayes' 
theorem 1.31 gives, where we let the 6 in Bayes' theorem 1.31 be P. 

56. Let X ~ P given P. Suppose that E|p(X)| < oo, where g : X -> IR. Prove 
that E(|(j(X)||P) < oo, a.s. 

57. *Let P have a Polya tree process distribution with each set partitioned into 

k sets at the next level. For each x and n, let i n (x) be such that C n (x) = 
Bi n (x)(B n -i(x)). Let Dirfc(a n ,i(a:), . . . ,a n ,fc(x)) be the prior distribution 
of W n (x) in (1.128), and define s n (x) = £* =1 a n ,<(x). 

(a) Prove that Pr(X = x) = Il"=i( a n,i n (x)/5n(x)). 

(b) If Xi,X2 are conditionally IID with distribution P given P, prove 
that 

n= 1 

where 6 n ,i is defined in (1.129). 

(c) If inf x , n s n (x) > 0 and sup x n , a n ,i(x)/s n (x) < 1, then P is continu- 
ous with probability 1. 

(Hint: Use Problem 49 on page 80.) 

58. Consider a Dirichlet process Dir(a) as a Polya tree with k = 2 and 
an,i(#) = c n for all n, i, and x. 

(a) If a = a(A'), show that c n = a/2 n for all n. (i/mt: Use the result 
of Problem 52 above, which implies that the product of a Beta(b, c) 
times an independent Beta(b + c, d) is Beta(b, c + d).) 

(b) Show that a condition for P to be continuous in Problem 57 above is 
violated. 



Chapter 2 
Sufficient Statistics 



We now turn our attention to the broad area of statistics. This will concern 
the manner in which one learns from data. In this chapter, we will study 
some of the basic properties of probability models and how data can be 
used to help us learn about the parameters of those models. 

2.1 Definitions 

2.1.1 Notational Overview 

We assume that there is a probability space (S,*4,/i) underlying all prob- 
ability calculations. It will be common to refer to probabilities calculated 
under fi using the symbol Pr( ). Conditional probabilities will be denoted 
Pr(-|-). We also assume that there is a random quantity X : S — ► X, where 
X is called the sample space (with <7-field B), which will usually be some 
subset of a Euclidean space, but will be a Borel space in any event. We 
will often refer to X as the data. Let Ax stand for the sub-cr-field of A 
generated by X (that is, Ax = X~ l (B)). Since (X,B) is a Borel space, B 
contains all singletons. This will allow us to claim that random quantities 
are functions of X if and only if they are measurable with respect to Ax 
by Theorem A.42. Generic elements of X will usually be denoted by x, y, 
or 2, or xi, #2, . . depending on how many we need at once. 

Assume that there is a parametric family Po of distributions for X, and 
let the parameter be 0 : S -> ft, where ft is a parameter space with a-field 
r. Usually, ft will be a subset of some finite-dimensional Euclidean space, 
but not always. X will usually be a vector of exchangeable coordinates, but 
this is not required. When the coordinates of X are exchangeable, then the 



2.1. Definitions 83 



elements of Vo will usually be distributions that say that the coordinates 
of X are IID each with distribution Pq for some 6 G ft. As mentioned 
in Section 1.5.5, we will use the symbol Pg to stand for the conditional 
distribution of X given 0 = 0 (a distribution on (X,B)) as well as the 
conditional distribution of each coordinate of X in the case in which X has 
exchangeable coordinates. We use the symbol Pq(-) to stand for Pr(|6 = 0), 
a probability on (S,Ax)- That is, for A G Ax with A = for some 

P«504) = P' e {X e B) = Pr(X G £|9 = 0) = P*(B). 

Example 2.1. Let {AVj^Lx be conditionally IID random variables with iV(0, 1) 
distribution given 9 = 6. Let X = (Xi, . . . ,X n ). If B G #\ the one-dimensional 
Borel cr-field, then 

Pe(B) = j -J== exp^- (X ~ 6)2 ^jdx = P' e {Xi e B) = Pr(X» G B|6 = 0). 
Similarly, if C € B n , the n-dimensional Borel cr-field, then 

P e (C) = ^ (2tt)"* exp^ ^(x< - 0) 2 ^ ctei • • - dx n 
= P*(X 6C) = Pr(X G C|6 = 6/). 

Let ne be the distribution of 6, and let D G r and B € B. Let A = 
X" X (B) and £ = e" 1 ^). Then we have 



Pr(X G B,Q e D) = n(AnE) = [ Pv(A\e)(s)dfi(s) = f P e (B)d(i € 

Je Jd 

where the last equality follows from Theorem A.81. 



Example 2.2 (Continuation of Example 2.1). Suppose that 8 ~ AT(0, 1). Then, 
for each D G r and B G B, Pr(6 G D, X G £) equals 



// 

Jd J b 



(2 7 r)- 2 * i exp( -i 



dxi • • • dx n d6. 



2.1.2 Sufficiency 

A statistic is virtually any measurable function of the data, X. 

Definition 2.3. Let (T,C) be a measurable space such that C contains all 
singletons. If T : X — ► T is measurable, then T is called a statistic. 

It appears that almost anything can be a statistic. The only requirement 
is that it be a measurable function of X to a space in which singletons 
are measurable sets, such as a Borel space. It will prove convenient, when 
T : X — ► T, to refer to T as a random quantity. When we do this, we will 



84 Chapter 2. Sufficient Statistics 



mean the random quantity T(X) : S — ► T. When we need to refer to the 
specific value that T assumes when X = x, we write T(x). We will let P$,t 
stand for the probability measure induced on the space (T, C) from P<9 by 
the function T. In this way P^TXX) € C) = P^t(C). We will often write 
P' e (T g C) to stand for this quantity as well. 

There is a special class of statistics that are very useful in statistical 
inference. These are statistics that provide a summary of the data sufficient 
for performing all inferences of interest. 

Definition 2.4. Let Pq be a parametric family of distributions on (X,B). 
Let (ft, t) be a parameter space and 0 : Pq — > ft be the parameter. Let 
T : X — ► T be a statistic. We say that T is a sufficient statistic for & (in the 
Bayesian sense) if, for every prior /ze> there exist versions of the posteriors 
^e|x and /x e |T sucri that, f° r every B € r /i e |x(#k) = /ie|r(i?|T(a;)), a - s - 
[a*x]> where fix is the marginal distribution of X. 

It appears that once one has settled on a parametric family of distributions 
for the data X, one need only calculate a sufficient statistic, because the 
posterior distribution of 0 given the sufficient statistic is the same as given 
X, no matter which prior distribution one uses. So long as one sticks with 
the chosen parametric family, the sufficient statistic is sufficient for making 
inference about 0, and thereby about future observations (conditionally 
independent of X given 0) through (1.37) on page 18. 1 

Example 2.5. Let {X n }55°=i be exchangeable Bernoulli random variables, and 
let Po be the set of all IID distributions (the largest parametric family available). 
Let X = (Xi, . . . , X n ), and let Pq be the distribution that says the coordinates of 
X are IID Ber(0) random variables. We have already seen (Theorem 1.56) that 
if the prior is /xe, then the posterior for 6 has Radon-Nikodym derivative 

jggix/flM _ fl s r-i»Hi-g) n - Er -' a< 

Next, treat T(X) = £? =1 Xi as the data. The density of T given 0 = 0 (with 
respect to counting measure on the nonnegative integers) is fr\e(t\0) = (^)^*(1- 
0) n-t for t = 0, . . . , n. It follows from Bayes' theorem 1.31 that the posterior given 
T = t = Xi ^ ias derivative 

d »e /(^^(l-^-^eW 

This is the same as the other posterior, hence T is sufficient according to Defini- 
tion 2.4. 

^ee Problem 24 on page 141 for an example of observations not conditionally 
independent given 0 for which a sufficient statistic is not sufficient for making 
predicitive inference. 



2.1. Definitions 85 



In Example 2.5, for every prior the posterior distribution of 0 given 
X = x was a function of T(x). The following lemma says that this fact is 
enough to conclude that T is sufficient. 

Lemma 2.6. 2 Let T be a statistic and let Bt be the sub-a-field of B gen- 
erated by T. Then T is sufficient in the Bayesian sense if and only if, for 
every prior distribution /ie, there exists a version of the posterior distribu- 
tion given X, ^s\X} such that for all B e t, /jl&\ X (B\') * 5 measurable with 
respect to Bt> 

Proof. This result is an immediate consequence of Theorem B.73 if we 
make the following correspondences: 



Theorem B.73 


B C 


Z 


Lemma 2.6 


X-^B) X~ l (B T ) 





□ 

Example 2.7. Let Vo be the set of all IID exponential distributions, and let P$ 
say that {X n }%Li are IID Exp{9) random variables. If /ie is the prior distribution 
and X = (Xi, . . . , X n ), then the posterior has density 

*2!£(,| X) = ^ex P (^p = xO 

with respect to the prior. Notice that this is a function of T(x) = Y17=i Xii nence 
Lemma 2.6 says that T is sufficient. 

There is a more commonly used definition of sufficient statistic, which 
does not refer to prior distributions. Loosely speaking, this definition says 
that T is sufficient if the conditional distribution of X given 0 = 9 and T 
does not depend on 9. 

Definition 2.8. Let Vo be a parametric family of distributions on (X,B). 
Let (fi,r) be a parameter space and Q : Vo — ► ^ be the parameter. Let 
T : X — ► T be a statistic. Suppose that there exist versions of Pe(-\T) and 
a function r:BxT->[0,l] such that r(-, t) is a probability on (X, B) for 
every t eT, r(A, •) is measurable for every A € B, and for every 9 G ft and 
every B £ B 

P 0 (B\T = t) = r(B,t), a.e. [P* lT ]. 
Then we say that T is a sufficient statistic for 0 (in the classical sense). 

This definition says that a statistic T is sufficient if and only if, after one 
observes the value T(X) = £, one can generate data X' with conditional 
distribution r(-,£), and then the conditional distribution of X' given 0 is 
the same as the conditional distribution of X given 0. It will be common 
to use the symbol E(-|T) in place of Ee{-\T) when T is sufficient. In such a 
case, if E e \g(X)\ < oo for all 0, then E(g(X)\T = t) = J g{x)dr{x,t). 



2 This lemma is used in the proofs of Lemma 2.15 and Theorem 2.29. 



86 Chapter 2. Sufficient Statistics 



Example 2.9 (Continuation of Example 2.5; see page 84). The X» are IID 
Ber{6) given 9 = 0, and X = (Xi,...,X n ). Let T(x) = E^i**- We need 
to compute P' e (X = x\T{X) = t) for all 0 and all x such that t = T(x). Since 
both X and T are discrete random variables, 

«<*-™-«>-t$^-(:)"' 

Set r(-,£) to be the distribution that is uniform on the set of all x such that 

5Zr=i Xi = * (probability (") for each such x.) We now see that T is sufficient 
according to Definition 2.8. 

Example 2.10 (Continuation of Example 2.7; see page 85). If X = (Xi, . . . , X n ) 
with the Xj having Exp(9) distribution given 6 = 0, let T(x) = Y^=i Xim ^ e 
need to find the conditional distribution of X given T = t. By Corollary B.55, 
the conditional distribution of X given T — t and 0 = 0 has density 

/x|e(*i,...,x n |0) _ 0 n exp(-0*) _ (n-1)! 



with respect to a measure i/atitOIOj which does not depend on 0. Since this 
distribution is the same for all 0, T is sufficient in the classical sense. 

In general, sufficient statistics need not be much simpler than the entire 
data set. 

Definition 2.11. Let X x , . . . , X n be random variables. Define X (1 ) to be 
min{Xi, . . . , X n }, and for k > 1, define 

X (fc) = min ({X x , . . . , X n } \ {X ( i), . . . , *(fc-i)}) • 

The vector (X(x), . . . , X( n )) is called the order statistics of Xi, . . . , X n . 

Proposition 2.12. Let X = (X x , . . . , X n ), and suppose that Xi, . . . , X n 
are exchangeable random variables. Then the order statistics are sufficient. 

Example 2.13. Let Xi, . . . , X„ be conditionally IID with Cauchy distribution 
Cau(0, 1) given B = 0. In this case, one can show that all sufficient statistics are 
at least as complicated as the order statistics. To do this, however, some theorems 
from Section 2.1.3 will prove useful. 

In all cases of interest to us, the two definitions of sufficient statistic are 
equivalent. 

Theorem 2.14. 3 Let (T,C) be a Borel space, and let T : X -> T be a 

statistic. The following are both true: 



3 The proof of part 1 is reminiscent of the proof of Theorem 1 of Halmos and 
Savage (1949). The proof of part 2 is due to Blackwell and Ramamoorthi (1982). 



2.1. Definitions 87 



1. If there is a a -finite measure v such that for all 0, Pe <C v and T is 
sufficient in the Bayesian sense, then T is sufficient in the classical 
sense. 

2. If T is sufficient in the classical sense, then T is sufficient in the 
Bayesian sense. 

The proof of part 1 of Theorem 2.14 requires a lemma that will also be 
used in the proof of Theorem 2.21. 

Lemma 2.15. Let v be a a-finite measure such that Pe <^v for all 0. IfT 
is sufficient in the Bayesian sense, then there exists a probability measure 
v* such that P e < v* for all 0 and dP 9 /dv*(x) is a function h{0,T(x)). 
Also, v* < ia 

Proof. Let dP e /du{x) = f X \e(x\0). Since each P e < v, Theorem A.78 
says that there exist countable sequences {0i\^ x and {ci}^ such that 
°i ^ °> HZi °i = h and Pe for every 0 G ft, where v* = Y^Li Ci^V 

Note that i/* < v. For 0 e Q such that 0 is not one of the 0*, specify the 
following prior distribution over fi: Pr(6 = 0) = 1/2, and Pr(6 = 0*) = 
Ci/2, for 2 = 1,2,.... The posterior probability of 6 = 0 given X = x is 



Pr(6 = 0|X = x) = 



/x|e(x|0) 



fx\e(x\0) + ZZi*fx\e(x\0i) 

V fx\e(x\0) J ' 

According to Lemma 2.6, for each 0, this is a function of T(x). That is, 
there exists a function h such that, for each 0, 

fx\s{x\0) 

^^^^^M^TW). ( 2 .16) 

By the chain rule A.79, it can be seen that the left-hand side of (2.16) is 
equal to dP e /dv*{x). 

For 0 e {0i}£ 1? replace the prior above by one that has Pr(0 = 0,) = c 
for all i. We still have that (2.16) is dP e /dv*(x). Also, Lemma 2.6 says that 
Pr(0 = 0j\X = x) is still a function of T(x) for all j. But Pr(G = 0AX = x) 
is just Cj times the left-hand side of (2.16). □ 
Proof of Theorem 2.14. For part 1, define the function r to be the 
conditional probability function on X given T = t calculated from the 
probability v* in Lemma 2.15. That is, for every C e C and every B e B, 



v*(T- l (C)C)B)= [ r{B,t)d*i.(t), 
Jc 



88 Chapter 2. Sufficient Statistics 



where i/£ is the probability on (T,C) induced by T from v*. It is easy to 
see that this implies that for every integrable g : T — > IR and B £ B, 

j g{T(x))I B (x)dv*(x) = j g(t)r{B,t)dv* T {t). (2.17) 

We now wish to show that this function r can serve as the conditional 
distribution of X given T = t and 6 = 8 for all 0. To see that this is 
true, note that for all B e B, Pe(B\T = t) is any function m : T — ► [0, 1] 
satisfying 

P' e {X e B,T(X) eC)= [ m(t)dPe lT (t), for all C e C, (2.18) 

where Pq } t is the probability on (T,C) induced by T from Pq. According 
to Lemma 2.15, we have that 

^~f(t) = h(e,t). (2.19) 

The left-hand side of (2.18) can be written as 

J I B (x)I c (T(x))h(d,T(x))dv*(x) = j I c (t)r(B,t)h(0,t)dvi(t) 

r(B,t)dP e , T (t), 



tc 

where the first equality follows from (2.17) and the second follows from 
(2.19). It follows that r(B,t) can play the role of m(t) in (2.18), and the 
proof of part 1 is complete. 

To prove part 2, let r be as in Definition 2.8, and let jze be a prior for 
9. By the law of total probability B.70 (conditional on T), the conditional 
distribution of X given T = t, /x X |t(- is given for every B € B by 

Hx\T(B\T = t) = I Pe{B\T = t)d^\ T {e\t) 

= [ r(B,t)dfi e]T (0\t)=r(B,t), 

where ^ e | T is the posterior distribution of 9 given T. Hence, we have 
that the conditional distribution of X given T and O is the conditional 
distribution of X given T. According to Theorem B.64, this means that X 
and 0 are conditionally independent given T. According to Theorem B.61, 
we have that the posterior given T and X is the same as the posterior given 
T Corollary B.74 says that the posterior given T and X is the same as the 
posterior given X . 

Blackwell and Ramamoorthi (1982) give an example in which the extra 
condition in part 1 of Theorem 2.14 fails and there exists a statistic suffi- 
cient in the Bayesian sense but not sufficient in the classical sense. There 
is, however, the following result. 



2.1. Definitions 89 



Theorem 2.20. // T is sufficient in the Bayesian sense, then for every 
prior distribution /ie, there exists a version of the conditional distribution 
of X given T, /ix|T such that for every B e B, 

fi @ ({e : P e (B\T = t)= fi x]T (B\t), a.e. [ M t]}) = 1, 

where \i? is the marginal distribution ofT. 

Proof. Let /ie be a prior distribution for G. Let ii X \ T (B\T) be the con- 
ditional probability of {X e B} given T = t calculated from the marginal 
distribution of X and T (not conditional on G.) Since T is sufficient in 
the Bayesian sense, the conditional distribution of G given T is the same 
as the conditional distribution of G given X (and T). This means that G 
is independent of X given T according to Theorem B.64. Theorem B.61 
then says that the conditional distribution of X given G and T is the same 
as the conditional distribution given T, which means that for all B G B, 
Po(B\T = t) = ii X \T{B\t), a.s. with respect to the joint distribution of G 
and T. The result now follows. □ 
Theorem 2.20 says that every prior distribution assigns probability 1 to a 
subset of the parameter space which, if it were the entire parameter space, 
would allow us to conclude that sufficiency in the Bayesian sense implied 
sufficiency in the classical sense without the added condition of absolute 
continuity. 

There is an easier way to characterize sufficiency in the case in which all 
conditional distributions given G are absolutely continuous with respect to 
a single cr-finite measure. 

Theorem 2.21 (Fisher-Neyman factorization theorem). 4 Assume 
that {P e :0 € fi} is a parametric family such that P e <.v (a -finite) for all 
6 and dP e /du{x) = f X \e(x\e). Then T(X) is sufficient for G if and only if 
there are functions mi and m 2 such that 

fx\e(x\0) = m 1 (x)m 2 (T(a:),^), for all 0. 

Proof. First, we do the "if" part. Let f X \e{x\9) = mi(x)m 2 (T(x),0) for 
all 6, and let /x e be an arbitrary prior for G. Bayes' theorem 1.31 says that 
the posterior distribution of G given X = x is absolutely continuous with 
respect to the prior, and the Radon-Nikodym derivative is 

d/iQ1X (0\x) = m i(sW?Xs),fl) = m 2 {T{x),9) 
dfis J Q roi(x)ro 2 (r(x), ^)d/x e (V0 J Q m 2 (T(x), ^)d^) ' 

which is a function of T(x). It follows from Lemma 2.6 that T(X) is suffi- 
cient. 



4 Versions of this theorem originated with Fisher (1922, 1925) and Neyman 
\i\700). 



90 Chapter 2. Sufficient Statistics 



For the "only if" part, assume that T(X) is sufficient. According to 
Lemma 2.15, there is a measure v* such that Pe v* for all 0, dPe/dv*(x) 
is a function h of 0 and T(x), and v* <C v. It follows that 

fx\e(x\0) = H9, T(x))^(x). 

If we set m\(x) equal to the second factor on the right and rri2(T(x),0) 
equal to the first factor on the right, we are done. □ 

Example 2.22. Let P e say that {X n }n=i are IID 11(0,6), Q = (0,oo), and let 
X = (Xi,...,X„). Then 

fx\e(x\6) = -^/ [ o,oo)(minxt)/ [ o,0](maxx i ). 

(7 i i 

By Theorem 2.21, T(X) = max, Xi is sufficient. If /ze is a prior, then the posterior 
has derivative (dfie\ T /dne)(9\t) = J[ t)OO )(0)/0 n . 

The case not covered by Theorem 2.21, in which not all Pe are absolutely 
continuous with respect to the same cr-finite measure, is more complicated. 
One case in which the conclusion to Theorem 2.21 still applies is that of 
discrete random variables. 

Proposition 2.23. // {Pe : 0 G fi} is a parametric family such that each 
Pe is a discrete distribution, then T(X) is sufficient in the classical sense 
for 6 if and only if there are functions mi and m<i such that 

Pr(X = x|6 = 0) = mi(x)m 2 (T(x),0), for all 0. 

This proposition is needed only to handle cases in which C = Ueenfa : 
Pq(x) > 0} is an uncountable set; otherwise all Pe are absolutely continuous 
with respect to counting measure on C. 

The following lemma tells us that when a statistic T is sufficient and the 
distributions of X are all dominated by a common cr-finite measure, then 
we can replace X by T and the distributions of T are still dominated by a 
cr-finite measure. In fact, we can give a formula for the density of T. 

Lemma 2.24. 5 Assume the conditions of Theorem 2.21, and assume that 
T : X T is sufficient Then there exists a measure v r on (T,C) such 
that P e ,r < and dPe^ldv r (t) = m 2 (t,0). 

Proof. Apply Theorem A.78 to find a probability v* = YXU c i p 0i such 
that P e < v* for all 0. Then 

dP 0( fx\e(x\0) m 2 (T(x),0) 



5 This lemma is used in the proof of Lemma 2.58. 



2.1. Definitions 91 



Since this density is a function of T(x), we can write 

P °,t(B)= I ^(x)du*(x) 
Jt- i {b) av 

_ f m 2 (T(x),8) f m 2 (t,9) 

~ Jt-hb) JZi am 2 (T(x)A) { ) ~ Ib EEi cm^tA) T ^ 

where is the measure on (T,C) induced by T from i/*. Define by 

dvrldvyit) = Ylili c i™>2(t, Oi) to complete the proof. □ 
In many of the examples that we have considered and will consider, there 

exists a sufficient statistic whose dimension is the same for all sample sizes. 

In such cases, there might exist a particularly convenient family of prior 

distributions available for the parameter. 

Theorem 2.25. Suppose that there exists a sufficient statistic of fixed di- 
mension k for all sample sizes. That is, suppose that there exist functions 
T n (with image T C M k for all n), mi iTl , and m 2 , n such that 

fx u ...,x n \e{xi, • • • , x n \0) = rni >n (xi, . . . , x n )m 2 , n {T n (xi,. . . , x n ), 0). 
Suppose also, that for all n and all t eT, 

0<c(t,n) = / m 2 , n (t,0)d\(0) < oo, 

for some measure A. Then the family of densities with respect to A 

forms a conjugate family in the sense that the posterior density with respect 
to X is a member of this class if the prior is a member of the class. 

Proof. 6 Let f B (0) = m 2 ,e(t,0)/c(tj) for some t G T and some I Let 
... , Ve) be such that T m (y u ..., y e ) = t. Suppose that the data are 
X\ = • • • , X n = x n for some n. We note that 

fx u ...,x i + n \e(xi, . . . , x n ,y u . . . ,y £ \0) 

= ™l,n(zi, • • • , Zn)m 2 , n (T n (a;i, • • • , &„), 0) 

x ™M(yi,...,lfc)m 2 ,*(*,0) (2.26) 
= m 1)n+€ (rri , • . . , x n , y 1 , . . . , yt)m 2t n+t(t' , 0), 

where t' = T n +*(xi, . . . ,x n ,2/i, . . . ,y £ ). The posterior density of O with 
respect to the measure A would be 

fe\x u ...,x n {0\x\,...,x n ) 



6 This proof follows the presentation of Section 9.3 of DeGroot (1970). 



92 Chapter 2. Sufficient Statistics 

m 2> £{t, g)mi )n (xi, . . . , x n )m 2 ,n{Tn{xi, . . . , x n ), 6) 
f n m 2| *(t, il))mi tn (xu • • • , x n )m 2 ,n(T n (a;i, . . . , x n ), ^)dA(^) 

c(t',n + *) ' 

by (2.26). □ 
The family of prior densities p and their corresponding distributions is 
called a natural conjugate family of priors. 

Example 2.27. Let {X n }™ = i be a sequence of conditionally IID Ber(0) random 
variables given 0 = 6. Let T n = £" =1 X<. Then m 2 ,n(t,0) = 0\\ - 0) n-t and 
c(£,n) = t\(n — t)!/(n + 1)!. The family of natural conjugate priors is a subset of 
the family of Beta distributions. In particular, ra2, n (£, 6)/c(t, n) is the Beta(t + 
l,n — t + 1) density as a function of 0. Actually, the entire collection of Beta 
distributions has the property that if the prior is in the Beta family, then the 
posterior is as well. Theorem 2.25 only tells us that Beta distributions with integer 
parameters are natural conjugate. 



2.1.3 Minimal and Complete Sufficiency 

The entire data set is always sufficient, so there is not always a savings in 
using a sufficient statistic. However, there are often times when a simpler 
statistic than the entire data set is also sufficient. There is a sense in which 
a sufficient statistic can be as simple as possible. 

Definition 2.28. A sufficient statistic T : X -> T is called minimal suf- 
ficient if for every sufficient statistic U : X -> U, there is a measurable 
function g:U -> T such that T = g(U), a.s. [P e ] for all 6. 

Clearly, a bimeasurable function of a minimal sufficient statistic is also 
minimal sufficient. The following theorem says that the mapping from data 
values to the likelihood function is minimal sufficient. 

Theorem 2.29. Suppose that there exist versions of f X \e('\0) f or ever y 0 
and a measurable function 1 T : X —> T such that T(y) = T(x) if and only 
ify£ V(x), where V{x) is the set 

{y € X : fx\e(y\0) = fx\e(x\0)h(x,y), V0 and some h(x,y) > 0}, 

then T(X) is a minimal sufficient statistic. 

Proof. First, we show that the distinct sets V(x) form a partition. If 
y e V{x), then we can set h(y,x) = l/h(x,y) and we get that f x \e(A e ) = 
f X \e(y\0)h(y y x) for all 9. So x e V(y). With h(x,x) = 1 for all x, we see 
that x e so the distinct V(x) form a partition. 



7 In most examples it is relatively easy to construct the function T, but you 
can actually prove that such a function exists in general. See Problem 15 on 
page 139. 



2.1. Definitions 93 



Next, we show that T(X) is sufficient. If /ie is an arbitrary prior, the 
posterior after learning X = x is absolutely continuous with respect to the 
prior with Radon-Nikodym derivative 



according to Bayes' theorem 1.31. If y G P(x), the posterior after learning 
X = y has Radon-Nikodym derivative 

dfie\x (0] v = h(x,y)f X \ & (x\0) 
dp* K lV) fh(x,y)f x &(x\1>)d^W>y 

which is the same as the posterior after learning X = x. Hence, the posterior 
is a function of T(x). Lemma 2.6 says that T(X) is sufficient. 

Finally, we prove that T{X) is minimal. Let U(X) be another sufficient 
statistic. Use the Fisher-Neyman factorization theorem 2.21 to write 

fx\e(*\0) = mi(35)m 2 (^),«). (2.30) 

Since P e ({x : rm(x) = 0}) = 0 for all 0, we can safely assume that mi{x) > 
0 for all x. We need to show that if U(x) = U{y) for some x, y € X, then 
y e V(x). It would then follow that T(y) = T(x), and this would make T 
a function of U . Suppose that U{x) = I7(y). Use (2.30) to write 

for all (9. With y) = roi(i/)/mi(x), we see that y 6 2>(x). □ 

Sf a T le T 2 ; 3 v S ?^° Se th 5\ Pe says that are IID ^(0) ra ndom 

variables. Let X = (Xi, . . . , X n ). Then 

/ X|e (x|(9) = 0 E ?=i**(i - fl)"-^*^ 
for all #i € {0, 1}. So the ratio 

/x|e(y|0) / 0 \ E r=iVi-E? =1 x i 



/x|e(*|0) 



is the same for all 0 if and only if Xj = yi< in which case fc( } = j 

^ ( f r.- { !- : E: = l2/< = Then T( *> = is the — al 

sufficient statistic. 

S3! e T 2 fv SUP fv Se tha t^ e SayS that <*">~=i are IID "(M) ra «dom 
varmDies. Let A == (Xi,. X„) and suppose that the sample space is Ht +n . 
Then /x| Q (x|0) = 9 n / [0 , e] (maxx<). Now suppose that 

fl~ n /[ 0 ,«](maxxi) = Ma;.y)^~ n /[o,e](max 2/i ), 



94 Chapter 2. Sufficient Statistics 



for all 9. This is true if and only if 

7[o,0](maxzi) = h(x, y)I [0i e](maxyi), 

for all 0, which in turn is true if and only if maxxt = max*/*, in which case 
h(x,y) = 1 and V(x) = {y : maxyt = maxxi}. Then T{X) = maxXi is the 
minimal sufficient statistic. 

Example 2.33. Suppose that P$ says that {X n }^Li are IID with density 
f Xl \e{y\B) = — e , for y = 0, . . . , 0, 

V' Z^t=0 t! 

with respect to counting measure on the positive integers. Here is the set of 
positive integers. If X = (Xi, . . . , X n ), then 

fx\e(x\0) = t — rn , if maxxi < 0, 

n: =1 (^)(ELoff) 

where x = 5Z™ =1 x,. Set y) = J^[? =1 (x*!)/ OH=i(^*0 ^ or a ^ x an( * 2/- ^ follows 
that fx\e(x\0) = /i(x, 2/)/x|e(y|0) for all 0 if and only if x * = Y^=i V* and 

maxxi = maxy*. Set T(x) = ($^7=i Xi ' maxx i) and note tnat x ^ ^(y) ^ and 
only if T(x) = T(y). So T(X) is minimal sufficient. 

There are some cases in which we need a sufficient statistic to satisfy an 
additional property. These situations are difficult to describe at the present 
time, but they arise in several places later in this text (in Chapters 3, 4 
and 5, in particular). 

Definition 2.34. A statistic T is complete if for every measurable, real- 
valued function g, E e {g(T)) = 0 for all 0 G ft implies g(T) = 0, a.s. [P e ] for 
all e. 

A statistic T is boundedly complete if for every bounded, measurable, 
real-valued function g, E e (g(T)) = 0 for all 9 £ Q implies g(T) = 0, a.s. 
[P 0 ] for all 9. 

Example 2.35. Suppose that Pe says that T ~ Poi{9). Suppose also that 



E e (g(T)) = Y,9(t) 



9 l exp(-0) 



= 0 



for all 9. Then X)£ o 0(*)07* ! = 0 for a11 °- This expression is a power series 
representation of the analytic function h(9) = 0. Since power series for analytic 
functions are unique, it must be that g(t)/t\ = 0 for all nonnegative integers t. 
There are many functions g with this property, such as g(t) = sin(27rt). All such 
g satisfy Pe(g(T) = 0) = 1 for all 9. So T is complete. 

Theorem 2.36 (Bahadur's theorem). 8 If U is a boundedly complete 
sufficient statistic and finite- dimensional, then it is minimal sufficient 



8 See Bahadur (1957). 



2.1. Definitions 95 



Proof. Let T be another sufficient statistic. We need to show that U is a 
function of T. Express U = (U\{X), . . . , Uk(X)), where k is the dimension 
of U. Let Vi(U) = (1 + exp(t/<))" 1 , so that V = (Vi, . . . , V k ) is a one-to-one 
measurable function of U and each Vi is bounded. Define 

Hi(t) = E e (V i (U)\T = t), 
Uiu) = E 0 (H i (T)\U = u). 

Since £/ and T are sufficient, these conditional means given 0 = 0 do not 
depend on 0. Since the Vi are bounded, so are the Hi and Li. Note that 

E,(^(f/)) = E,(Ea(K(t/)|T)) = E,(^(T)) 
= E,(E,(/f i (r)|^/))=E,(L i (f/)). 

It follows that Ee(Vi(U) - L^U)) = 0 for all 0. Since U is boundedly com- 
plete, it follows that P e (Vi = £,<) = 1 for all (9. So, E^L^T) = Hi(T). 
Since and if* are bounded, they have finite variance, and Proposi- 
tion B.78 says that 

V*r e (Li(U)) = E^Var^Li^Jirj + Var^ffi^)), 
Var tf (£r 4 (r)) = E^Var.^^l^+Var^L,^)). 

It follows easily from these equations that Vax e (Li{U)\T) = 0, a.s. [P#], 
hence Var fl (V5(^)|r) = 0, a.s. [P e ). So V5(J7) = E(Yi(U)\T) = i/ 2 (T), a.s. 
[P*]. Since V- is one-to-one, we get t7< = Vp^i^r)) for each i, and C/ is a 
function of T, as needed. n 



2.1.4 Ancillarity 

At the other extreme from sufficiency lie statistics that are independent of 
the parameter. 

Definition 2.37. A statistic U is called ancillary if the conditional distri- 
bution of U given 6 = 0 is the same for all 0. 

Example 2.38. Let X U X* be conditionally independent given 6 = 0, each with 
conditional distribution N(6, 1). Let U = X 2 - X^ The conditional density of U 
given B = 6 is N(Q, 2). Since this distribution is the same for all 0, £/ is ancillary. 

Sometimes the two extremes meet and a minimal sufficient statistic con- 
tains a coordinate that is ancillary. 

Definition 2.39. If a minimal sufficient statistic is T = (T U T 2 ) and T 2 is 
ancillary, then T x is called conditionally sufficient given T 2 . 

^ am ?!o I' 40 ' , S ?PP° se that *i ,".,*» are conditionally IID given 0 = 0 with 
U(0 - 1/2, 6 + 1/2) distribution. Let X = (X lf . . . , X n ). Then 

/x|e(*|0) = /[^ ^(minx*)/^ ^^(maxii). 



96 Chapter 2. Sufficient Statistics 



Let Ti = maxXi and T 2 = max X l - minXi. Then T = (Ti,T 2 ) is minimal 
sufficient and T 2 is ancillary (see Problem 25 on page 141). So T\ is conditionally 
sufficient given T2. In particular, if n = 2, the density of T2 with respect to 
Lebesgue measure is /t 2 (*2) = 2(1 - t2)/[o,i](*2). 

When a statistic is ancillary, it does not mean that you should ignore it. 
It only means that if you learned nothing but the ancillary, you would not 
change your mind about 0. You might, however, change your mind about 
everything else, including the conditional distribution of other data given 
B. 

Example 2.41 (Continuation of Example 2.40; see page 95). The joint density 
of T given 9 = 6 is 

fT u T 2 \e(ti,t 2 \0) = n(n- l)tJ~ 2 /[o,i](t 2 )/[ fl _j +taffl+ ^(*i). 

Since T\ is the maximum of n IID uniform random variables given 0, it follows 
that the distribution of T\ given G = 0 is Beta(n, 1) shifted by 0 - 1/2. It is 
not hard to show that the conditional distribution of Ti given (6,T 2 ) = (0,£ 2 ) is 
U(0 — l/2 + £ 2 ,0 + 1/2). The bigger t 2 is, the more concentrated the distribution 
of Ti given 6 is. Even though T2 is ancillary and (by itself) tells us nothing about 
6, it tells us something about the conditional distribution of Ti given 6. 

A common (but not universal) suggestion, in classical inference, is to 
perform inference conditional on ancillaries. The reason for this is that 
when one performs classical inference conditional on a statistic, the statis- 
tic does not count as data in the inference; it merely counts as background 
information that we supposedly knew before we collected the data. Since 
the ancillary does not contain (in itself) any information about G, no in- 
formation is being lost by not treating it as part of the data. This allows 
the classical statistician a convenient way to condition on at least some of 
the data. In the Bayesian framework, one could construct a joint distribu- 
tion for G and the data, then condition on all of the data, and whether 
a statistic is ancillary becomes irrelevant. In fact, whether a statistic is 
sufficient becomes irrelevant. Fisher (1934) proposed inference conditional 
on ancillaries because he claimed that it made better use of the informa- 
tion available in the actual sample obtained. Example 2.52 on page 100 
illustrates Fisher's point, as does the first half of the following example. 

Example 2.42. Suppose that X U X 2 are conditionally IID with U(0 - 1/2,0 + 
1/2) distribution given G = 0. Let Ti = max{Xi,X 2 } and T 2 = \Xi - X 2 |. It is 
traditional, in the classical literature, to interpret the statement 

Pe(Ti - T 2 < O < Ti) = I for all 0 
as meaning that one is 50 percent confident that the random interval 

[Ti - T 2 ,Ti] = [minXi,maxXi] (2.43) 



2.1. Definitions 97 



will contain 9. 9 However, we already saw that the conditional distribution of 7\ 
given T 2 = t 2 and 9 = 0 is U(0 - 1/2 + * 2 ,0 4- 1/2). It follows that 



If, for example, T 2 > 1/2 is observed, then we know that 6 is in the interval 
between minXi and maxX*. It would seem that knowledge of the ancillary gives 
us a better idea of how much "confidence" we should have that 6 is in the interval. 

Alternatively, we could choose our interval using the conditional distribution 
given the ancillary T 2 . For example, we can easily show that 



In the classical theory, one would be 50 percent confident that the random interval 
[7^ - (1 -f T 2 )/4,Ti + 1/4 - 3T 2 /4] covers © conditional on T 2 . In fact, since 
the probability in (2.44) is the same for all T 2 values, one would be 50 percent 
confident that the random interval covers 9 marginally. If one desires an interval 
in which one can place 50 percent confidence after seeing the data, then the 
interval in (2.44) makes far more sense than the one in (2.43). If T 2 is observed to 
be small, then we have not learned much about 9, and the conditional interval is 
wide to reflect the uncertainty. The unconditional interval in (2.43) is very short, 
however, which is counterintuitive. Similarly, when T 2 is observed to be large, we 
have learned a lot about 9 and the second interval is short, while the first one is 
wide. 

Suppose that we have a prior distribution for 9 with density fe(0). Then the 
posterior density of 9 is a constant times /e(0)/[t 1 -i/ 2 +t 2 ,ti+i/ 2 ](0)- If /e(#) is 
almost constant over the interval [t\ — 1/2, ti — t 2 + 1/2], then the posterior is 
approximately 



The posterior probability that 9 is in the interval in (2.44) is nearly 1/2. If 
one uses the improper prior with constant density, then the posterior for 9 is 
U(U — 1/2 + 1 2 , ti + 1/2), and the fact that the posterior probability is 1/2 that 
9 is in the interval (2.44) will turn out to be a special case of Theorem 6.78. 

Sometimes there is more than one ancillary statistic available. Some prin- 
ciple is needed to choose between them. 

Definition 2.45. An ancillary U is maximal if every other ancillary is a 
function of U. 

Example 2.46. Let P e say that {Y n }n=i are IID with density (with respect to 
counting measure on the set {(0,0),(0,1),(1,0),(1,1)}) 





,) =i, for all 0. (2.44) 



fe\x(0\x)* 




( i 



±(1-0) ify = (0,0), 

±(1 + 0) if y = (0,1), 

±(2 + 0) if» = (l,0) 

±(2-0) if y = (1,1). 



Mie(j/|0) = < 



This is an example of a confidence interval statement. We will discuss confi- 
dence intervals in more depth in Section 5.2.1. 



98 Chapter 2. Sufficient Statistics 



Here, Q = [0, 1]. Now, let X = (Yi, . . . , Yn). Let the observable counts be Nij 
equal to the number of Ys with the first coordinate i and the second coordinate 
j. Let Mi be the number of vectors with the first coordinate i, and let Nj be the 
number with the second coordinate j. 

First Coordinate 



Second 
Coordinate 





0 


1 




0 




N 10 


No 


1 


Not 


Nn 


Ni 




Mo 


Mx 





Then N = N 0 o + JVio + N 0 i + N n and 

fx\e(x\0) = (l - 0)"°°(1 + e) Noi {2 + 0)"io (2 _ 

Any three of the JV»j is minimal sufficient. We also see that 

Mo ~ Bin^N, ^ , given 6 = 0, 

No ~ Bin ( AT, |) , given 6 = 0. 

Both Mo and iVb are ancillary, but neither is maximal. The conditional inference 
will depend on which ancillary one chooses. 

For example, E*(l - 3Noo/N 0 \N 0 ) = 9, and E e (l - 2N 00 /Mo\M 0 ) = 9. If 
one wanted to estimate 0 in the classical framework, there would seem to be 
two natural estimators available depending on which ancillary one chooses. (See 
Problems 31 and 32 on page 142.) 

Sometimes we need to condition on a statistic even if it is not ancillary. 
The following example was given by Morris DeGroot (personal communi- 
cation). A similar example can be found in Pratt (1962). 

Example 2.47. Consider a meter that is trying to measure a quantity 6. Sup- 
pose that the meter gives a reading Z, which has N(0, 1) distribution given 6 = 0 
if Z < 2, but if Z > 2, the reading is always 2. Let X = min{Z, 2} be the reading. 
Then Pe(X = 2) = 1 - $(2 - 0), where $ is the standard normal distribution 
function. For x < 2,f X \e(x\9) is the N(0, 1) density. The event {X = 2} is not 
ancillary but is obviously important for what inference to perform. For example, 
trying to construct an unbiased estimator is difficult since 

1 f (2-0) 2 \ 
Ee(X) = *(2 - 0)0 + 2[1 - $(2 - $)] - -j= exp j • 

On the other hand, if X < 2 is observed, the inference should be the same as 
if we had merely observed Z, since we actually did observe Z and the fact that 
Z > 2 could have occurred but didn't is irrelevant. If X = 2 is observed, the 
inference should be based on the fact that all we know is Z > 2, since the fact 
that X < 2 had been possible is now irrelevant. 

A possible Bayesian solution to this problem would be to let 9 have a conju- 
gate prior distribution, say 0 - AT(0 o ,ag) for known values of 0 O and a Q . The 
conditional distribution of 6 given X = x is N(6u<Ti) if x < 2, where 



2.1. Definitions 99 



Inference would then proceed as if no truncation had been possible. On the other 
hand, if X = 2 is observed, the conditional distribution of 9 given X = 2 has 
density 



Brown (1967) and Buehler and Fedderson (1963) prove that there are 
other statistics that are not ancillary but on which it might pay to condition 
when making inferences. In particular, they consider the case in which 
Xi,...,X n are conditionally IID with iV(/x, a 2 ) distribution, conditional 
on 6 = 0 = Oi,<x). Let X = YS=i Xi/n and S 2 = TZ=i( x i-X) 2 /(n-l). 
It is well known that Pe(\X - /i\/S > k) depends only on k and n, call it 
a(k, n). What these authors show is that there is a set C and a number 
a < a(k,n) such that P e (\X - fi\/S > k\(X,S) € C) < a for all 0. Pierce 
(1973), Wallace (1959), and Buehler (1959) give conditions under which 
such examples can and cannot arise. 

Ancillaries are only useful if there is no boundedly complete sufficient 
statistic. 

Theorem 2.48 (Basil's theorem). 10 IfT is a boundedly complete suf- 
ficient statistic and U is ancillary, then U and T are independent given 
0 = 0, and they are marginally independent no matter what prior one 
uses. 

Proof. Let A be some measurable set of possible values of U. Since U is 



ancillary, P' g {U e A) = Pr(U € A) for all 6. But, P' e (U € A) = f P' e (U e 
A\T = t)dPe iT (t). So 



for all 6, since T is sufficient. Let g(t) = Pr(U e A) - Pr(U e A\T = 
t), which is a bounded measurable function. Equation (2.49) says that 
Ee(g(T)) = 0 for all 9. Since T is boundedly complete, we have P' e (g{T) = 
0) = 1 for all 0. This means that 'P^f/ e A) = P' 9 (U € A\T = t), a.s. [P 6 ,t\ 
for all 0, which implies that U and T are conditionally independent given 
0 = 0, for all 0. 



See Basu (1955, 1958). 




If we want the posterior mean of G, we can integrate to get 




[Pr(f/ € A) - Pr(t/ € A\T = t)} dP e , T (t) = 0, 



(2.49) 



100 Chapter 2. Sufficient Statistics 



Let /ie be an arbitrary prior, and let B be a measurable set of possible 
values of T. 

Pr(UeATeB)= [ f Pr(U e A\T = t)dPe tT (t)diH>(0) 

= [ Pi{Ue A)P e , T {B)diie(0) = Pv{U € A)Pr{T e B), 

which says that U and T are marginally independent. □ 
Basu's theorem 2.48 says that if T is a boundedly complete sufficient 
statistic, conditioning on an ancillary is not going to change the joint dis- 
tribution of T and B. Both Bayesian and classical statisticians would ignore 
the ancillary in such a case. 

Example 2.50. Suppose that P 0 says that {X n }£°=i are IID N(0, 1). Let X = 
(Xi, . . . ,X n ), T = X, and U = £" =1 (Xi - T) 2 /(n - 1). Then T is a complete 
sufficient statistic and U is ancillary. They are independent given 0 = 0 and are 
marginally independent no matter what prior we use. 

Example 2.51. Suppose that P 9 says that {X n }%Li are IID iV(^,cr 2 ), where 
0 = (/x, a). Let X = (X u . . . , X n ), 7i = X, and T 2 = vT^iW-^) 2 /^-!)- 
The fact that T = (Ti, T 2 ) is a complete sufficient statistic will follow most easily 
from Theorem 2.74, to be proven later. Let 

TT ( Xx-Tx X n -Ti \ 
V T 2 T 2 )' 

Then, U is ancillary and independent of T given 6 = 0. Also T and U are 
marginally independent no matter what prior we use. The distribution of U is 
uniform on a sphere of radius 1 in an (n - l)-dimensional hyperplane. (See Prob- 
lem 28 on page 141.) 

One reason that some people give for conditioning on an ancillary is 
that they get a better measure of the precision of the inference. Here is an 
example due to Basu. 

Example 2.52. Let 6 = (6i, . . . ,Bn), where N is the (known) size of a pop- 
ulation and Si is some characteristic of unit i in the population. Select a set of 
labels ii , . . . , i n from {1, . . . , N} with replacement, with n<N. Let Xj = 0^ be 
observed for j = 1, . . . , n. Let X = (X u . . . , X n ). If the selection is random, then 
f X \e(x\6) = 1/N n for all x compatible with 0. (Notice that the distribution of X 
is dependent on 9 even though the distribution of the labels is not.) Let M be the 
number of distinct labels drawn. Then P e (M = m) is the same for all 0, so M is 
ancillary. Let Xf, . . . , X* M be the distinct observed values. One possible estimate 
of the population average is X* = X i/ M - The conditional variance of X* 
given 0 = 0 and M = m is 

— * N — ma 2 

Var(X |9 = 0,M = m) = 1 r rr j — , 



where a 2 = ~ °) 2 / N ' 



2.1. Definitions 101 



To see that this is a better measure of the variance of X* than is the marginal 
variance, consider the simple case n = 3. The distribution of M is 



/ l 



N 2 



if m = 1, 
if m = 2, 



< N -^ N - 2 > ifm = 3, 



/M(m) = < 

0 otherwise. 
Since E(X* |9 = 0, M = m) = 0 for all m, it follows that 

N - M <j 2 



Var(X 



'l e = «)» E (^Ts|e = «) 



= £-[l + (JV-2) + 



N-l M 
(N - 2)(N - 



3)^ 



•2JV + 3 



iV 2 



If M = 3, the marginal variance is larger than the conditional variance, while if 
M = 1, the marginal variance is too small. 

To execute a Bayesian solution, we need a distribution for 0. Suppose that 
we model the 0i as exchangeable random variables with 6i conditionally inde- 
pendent N{%I),(T 2 ) given (#,E) = The distribution of V given E = a is 
iV(^o,cr 2 /A 0 ), while the distribution of E 2 is T~ 1 (a 0 /2, 6 0 /2). n The data consist 
of observing M = m and 6^. = x* , for j = 1, . . . , m. The unobserved Si are still 
exchangeable, and their conditional distribution given (\I>, E) = a) is as in the 
prior. The distribution of # given E = a and X = x is # ~ JV(V>i, cr 2 /Ai), and 
the distribution of E 2 given X = x is r _1 (ai/2, &i/2), where 



Ai = Aq + ra, 



_ Aq^q + ma* 



oi = a 0 + m, 6i = fe 0 4- V(x* - x* ) 2 + 

' ra -f A 0 



The posterior distribution of the population average 6 is obtained in stages. 
First, conditional on E) = (-0, cr), 



f mx* + {N - m)iP 2 N-m\ 
[ N ' a -NT-)' 

Integrating out of this, we get that conditional on E = a, 



9' 



(N - m)ipi a 2 



Ai 



Finally, integrating a out, we get that 6 has distribution 



( mx* + (N- m)tl)i bi 



N 



' iV 2 ai 



N -m + 



(N 



11 This is an example of a hierarchical model, which will be discussed in more 
detail in Chapter 8. 



102 Chapter 2. Sufficient Statistics 



If we use an improper prior (A 0 = 0, 6 0 = 0, a 0 = -1), then the location becomes 
x* and the squared scale factor becomes 

N ~ m 1 yy,^ 

Nm m - 1 2J< X * X > ' 

i=l 

The latter is very close to the traditional finite population sampling theory vari- 
ance estimate. 

We conclude this section with two examples that are similar on their 
surface, but in one example the ancillary is part of the sufficient statistic 
and in the other it is not. 

Example 2.53. Let Z ~ Ber{\/2) (independent of 6), and let Y and W be 
conditionally independent given 9 = 6 (and independent of Z) with Y ~ jV(0, 1) 
and W ~ jv(0, 2). If Z = 0, we will observe X = (K, Z). If Z = 1, we will observe 
X = (W,Z). Let Xi stand for the first coordinate of X. The likelihood function 
is 

/x|e(si)2 , 0) = _L; [ exp (_l (xi _, )2 )p [-Lexp^ -*)>)' *. 

It is possible to show that X\ is not a sufficient statistic. (See Problem 8 on 
page 138.) In this case, it makes perfect sense to perform inference conditional 
on the ancillary Z. 

Example 2.54. Let Z ~ Ber{\/2) (independent of 9), and let {Y n }^L x be 
conditionally IID with Ber(9) random variables given 9 = 0 (and independent 
of Z. ) If Z = 1, we will observe Y\ , . . . , Y n for fixed n. If Z = 0, we will observe 
the Yi until we see k successes with k < n. In Problem 9 on page 138, we will 
see that a sufficient statistic is (iV, M), where N is the number of observed Yi 
and M is the number of successes among the observed YJ. Clearly, Z is ancillary, 
but it is difficult to justify conditioning on Z since it is not part of the sufficient 
statistic. That is, if we observe n of the Yi and there are k successes among them, 
then it does not matter to us whether Z = 0 or 1. (Of course, in all other cases 
we can figure out what Z was from the rest of the data.) 



2.2 Exponential Families of Distributions 

2.2.1 Basic Properties 

There is a special class of distributions for which complete sufficient statis- 
tics with fixed dimension always exist. This class includes some, but not 
all, of the commonly used distributions. 

Definition 2.55. A parametric family with parameter space ft and density 
fx\e(x\6) with respect to a measure v on (X,B) is called an exponential 
family if 

fx\e(*\0) = c(0)h(x)exp ^^(^(xjj , 



2.2. Exponential Families of Distributions 103 

for some measurable functions tti , . . . , 7Tfe, t\ , . . . , tk and some integer k. 

Example 2.56. Suppose that Pg says that {X n }^ =1 are IID N{p.,a 2 ), where 
0=Oi,o). Let X = (Xi,... ,X„). 

f*1*W) = (^r)^ eXp {-2^E (a:i -^ 2 } 

In this form, we see that A: = 2 and 

1 n 

i=l 

The function c(0) in Definition 2.55 can be written as 

< e ) = (j x Hx) exp m {6)ti (x) | du(x)^j , 

so^that the dependence on 6 is through the vector n = (tti(0), . . . , 7r fc (0)) e 
]R . We might as well let % be the parameter. 

Definition 2.57. In an exponential family, the natural parameter is the 
vector II = (»i(e), . . . , 7r fc (0)), and 

r = |tt € R k : h(x)exp^J2nM^ du{x) < ooj 
is called the natural parameter space. 

The mapping II : fi -» r need not be one-to-one, nor need it be onto It 
is common, however, to use the symbol O for the natural parameter and 
assume that 17 = T. It is obvious from the form of the exponential family 
density and the Fisher-Neyman factorization theorem 2.21 that T(X) = 
{MX ), . . . , t k {X)) is a sufficient statistic. This statistic is sometimes called 
the natural sufficient statistic. 

The sufficient statistic from an exponential family sample also has an 
exponential family distribution. 

Lemma 2.58. 12 If X has an exponential family distribution, then so does 
the natural sufficient statistic T(X), and the natural parameter for T is the 

12 This lemma is used in the proof of Theorem 2.62. 



104 Chapter 2. Sufficient Statistics 



same as for X. In particular, there is a measure such that Pq,t 
for all 0 and dPe,T/dv T (t) = c(0) exp(0 T t). 

Proof. Apply Lemma 2.24 with m 2 {T{x), 0) = c(0) exp(53jL x 0^i(z)) and 
mi(x) = □ 

Example 2.59 (Continuation of Example 2.56; see page 103). In the case of 
n conditionally IID iV(/Lt, a 2 ) random variables, the natural sufficient statistics 
are Ti = £^=1 X» and T 2 = ^=1^* I<: is wel1 known that T i and W = 
T2~Ti/n are independent, with Ti having iV(n/x, rwr 2 ) distribution and W having 
T([n - l]/2, l/[2a 2 ]) distribution. It follows that the joint density of (Ti,T 2 ) is 

This can be simplified to c(0)h(ti, £2)exp(0iti + 02t2), with c(0) the same as in 
Example 2.56 and /i(ti,$2) a constant times (£ 2 - ii/ra)^ 1-1 ^ 2 . 

There are degenerate exponential families. That is, it is possible for some 
linear function of X to be constant (the same constant for all 0) with 
probability 1 given 0 = 0 for all 0. For example, let Yi be k\ -dimensional 
and have conditional density given 9 = 0 (with respect to a measure v\) 
c(0)exp(yj0). Define T T = (0 T ,# T ), where # is /^-dimensional. Let u 2 
be the measure that puts a mass of 1 on the point r € IR fc2 and puts 0 
mass on the rest of IR* 2 . Define v = v x x i/ 2 , and let Y T = (Yi T ,r T ). Then 
y has conditional density given r = (0,-0) = 7 with respect to v, 

c*(7)exp(2/ T 7) = c(0)exp(yj"0), 

where c*(7) = c(0) exp(-r T i/>). The natural parameter space of T values 
is fl x IR 2 , where (I is the natural parameter space of 6 values. For this 
reason, we introduce a definition. 

Definition 2.60. An exponential family of distributions for X is degener- 
ate if there exists at least one vector a and a scalar r such that Po(a T X = 
r) = 1 for all 0. If the exponential family is not degenerate, it is called 
nondegenerate. 

Example 2.61. Let X ~ Mult k (n\p u . . . ,p fc ) given P = (pi, . . . ,p0- The nat- 
ural parameter is 6 = (log Pi, . . . ,logP fc ). We know that P e (l J X = n) = 1 for 
all 0, where 1 is a vector of A: Is. 

When the exponential family is degenerate, the natural parameter space 
is a subset of a (k - l)-dimensional linear manifold in IR*, hence it has 
empty interior. Some theorems in Section 2.2.2 will require that the natural 
parameter space have a nonempty interior. However, degenerate families of 
distributions are easily converted into nondegenerate families by means of 
linear transformations. For example, with the multinomial distribution, we 
could just delete the last coordinates of both X and the natural parameter. 
For this nondegenerate family, the natural parameter space does contain 
an open subset of IR* -1 . 



(ti-n M ) 2 | ^ tV 
n n 



2.2. Exponential Families of Distributions 105 

2.2.2 Smoothness Properties 

The means of functions of exponential family random variables tend to be 
smooth functions of the natural parameter. In fact, the natural parameter 
space is itself a nice subset of Euclidean space. 

Theorem 2.62. The natural parameter space ft of an exponential family 
is convex and l/c{0) is a convex function. 

Proof. We will work in the sufficient statistic space. Write l/c(0) = 
J exp{t T 6}dur(t), where v? is the measure from Lemma 2.58. Since exp(-) 
is a convex function, we get, for 6\, 6 2 £ ft and 0 < a < 1, 

< j (aexp{t T 0!} + (1 - a) exp{t T 0 2 }) du T (t) 

= a J exp{* T 0i}di/ T (£) + (1 - a) j exp{t T 6 2 }dv T (t) 

= a-TTTT 4- (1 -oi)-^-r < oo. 

This proves that aOi + (1 - a)0 2 e ft, so ft is convex. It also proves that 
l/c is convex. □ 

Example 2.63. The family of exponential distributions, Exp{$), with densities 
/x|*(a#) = </>exp(-^x) for x > 0 has = I (0tOo) (x) and natural parameter 
0 = So c(0) = -0 and l/c(0) = -1/0 is convex. The natural parameter 
space is (-00,0), a convex set. 

The following theorem is used in several places in the remainder of the 
text to establish smoothness properties of various conditional means, given 
the natural parameters, of functions of random variables with exponential 
family distributions. 

Theorem 2.64. Let the density ofT(X) with respect to a measure v r be 
c(0)exp{* T 0}. If 4> :T -» 2R is measurable and f \(f>(t)\exp{t T 6}dv T {t) < 
oo, for 9 in the interior of the natural parameter space, then 



/(*) = j <t>{t)exp{t T z}du T {t) 



is an analytic function 13 of z in the region where the real part of z is interior 
to the natural parameter space, and 



d r 

Q^f( z ) = J U<t>(t)exp{t T z}du T (t). 



13 By analytic function, we mean a complex-valued function of a complex (vec- 
tor) argument that is differentiate with respect to that complex argument. 



106 Chapter 2. Sufficient Statistics 



Proof. We will do the k = 1 case, as the others follow by induction. Let 
z Q = a + ib and 6 = 6\ + 162 for some a in the interior of Q. Then 

+ = / ^ex^^jSSfflzi^t). (2.65) 

The maximum modulus theorem C.8 says that an analytic function on a 
closed bounded set achieves its maximum on the boundary of the set. For 
0 < 7 < e, consider the set C(7,e) = {z : 7 < \z\ < e}. For fixed e and 
every 0 < 7 < e, 

^ exp(|t|e)-l exp(|t| 7 )-l j 

Since the limit as 7 — > 0 of the last term above is \t\ and exp(|£|e) - 1 > \t\e, 
it follows that \(exp(tz) - l)/z\ < exp(\t\c)/e for all \z\ < e. Thus, we have 
that if \6\ < e, the absolute value of the integrand in (2.65) is no more than 
|0(t)| exp(at) exp(|t|€)/e. Thus, the integral of the absolute value is at most 
/ (exp{t(a + e)} + exp{t(a - e)}) di/r{t)/^- Choose e small enough so 
that a ± e are in the interior of ft. By the dominated convergence theorem, 

lim /(*o + *)-/(*o) = J meMtZo)duT{t y D 



max 

*€C( 7 ,e) 



exp(tz) — 1 



Theorem 2.64 allows us to calculate moments of the sufficient statistics 
in exponential families by taking derivatives of the function logc(0). 

Example 2.66. Let <j)(t) = 1. Then we can calculate 

E 9 (Ti) = J c($)tiexi>(t T e)dur(t) = c(0) J- J exp(* T 0)dv r {t) 
^ 1 1 dc{6) d . /m 

Example 2.67 (Continuation of Example 2.63; see page 105). Consider the 
Exp(ip) distribution with 0 = -tp. Here iogc(0) = log(-0). So the partial with 
respect to 0 is 1/0 = -E e (T). 

Example 2.68 (Continuation of Example 2.56; see page 103). Consider the 
case of AT(/x,cr 2 ) distributions. Here, the natural parameter is_(0i,02) = » 
-1/[2<t 2 ]), and the natural sufficient statistic is (Ti,T 2 ) = (nX, 5^7=1 Xi )• So ' 

logc(0) = flog(-20 2 ) + ^. (2-69) 
The partial derivative with respect to 0i is 

^| = -n/x=-E,(T 1 ). 

The partial derivative with respect to 02 is 



2.2. Exponential Families of Distributions 



107 



The method illustrated in Example 2.66 is actually quite general. 

Proposition 2.70. 14 Let T = (Ti, . . . ,2^). Suppose that the conditional 
density ofT given Q = 0 is f T \e{t\0) = c(0)exp(0 T t). Let 4, . . . ,4 > 0 
6e sucft *Aa£ ^ = 4 + • • • + 4. T/ien 



9 £ 



dd^ 1 ■■■de e k k c{ey 



In particular, E 0 {Ti) = -0/00* logc(0), and 

d 2 



Cov e (Ti, Tj) = - 



logc(0). 



Example 2.71 (Continuation of Example 2.68; see page 106). For the N(n, a 2 ) 
case, log c(9) is given in (2.69). The covariance of X and £" =1 X t 2 is 



d 2 



d6id9 2 
as can be verified directly. 



^1 o 2 

= = 2n / XCT > 



A similar result holds for the posterior means of polynomial functions of 
O, if we use a conjugate prior. 

Proposition 2.72. Let X = (X u . . . , X n ) where the X { are conditionally 
IID, given 6 = 6, with density equal to c(0) exp(0 T T(x)), where 0 is a 
k-dimensional parameter. Suppose that the prior for 6 25 proportional to 
c(0) a exp(0 T b), where a > 0 and b is a k-dimensional vector (a natural 
conjugate prior). Suppose that 4 , . . . , £ k > 0 and i = 4 + . . . +4 . Write the 
predictive density of X as f x (x) = g(t u . . . ,t k ), where t< = ^-i Ti{ Xj ). 



e (Hef- 



X = x 



fx(x)dt e S...dtl* 9itl '---' tk) - 



S£SS? ^ POSC that X i<~->Xnaie conditionally IID with N(n, a 2 ) 

d.stnbution given M = „ and £ = a. Let the prior be natural conjugate m in 
Example 1.24 on page 14. The marginal density of the data is given in (OT) 
m that_example. Rewriting (1.27) in terms of the natural sufficient statistics 
Ti = nX and T 2 = W + nX , we get 

9(ti,t 2 ) = constant x (b 0 +t 2 + ^-+ wA ° [!i-» n l 2 V^ 
\ n A 0 + n L n p U J J ' 



14 This proposition is used in the proofs of Theorems 3.44 and 7.57. 



108 Chapter 2. Sufficient Statistics 



The partial derivative of this with respect to ii divided by p(ti,t2) equals 

2 \ n Ao + nLn 1 J \Ao-fnLn J n J 

which simplifies to /ziai/bi, the posterior mean of ©i = M/S 2 . 

Diaconis and Ylvisaker (1979) prove other interesting results about pos- 
terior means of parameters when conjugate priors are used for exponential 
families. 

The following theorem is used to show that certain estimators and hy- 
pothesis tests have classical optimality properties when the data come from 
an exponential family. 

Theorem 2.74. // the natural parameter space ft, of an exponential family 
contains an open set in M k , then T(X) is a complete sufficient statistic. 

Proof. We will prove the k = 1 case, and the others follow by induction. 
Let T(X) have density c(0)exp{£0} with respect to a measure 1/7- • Let g 
be a function such that Ee(g(T)) = 0 for all 0. Then 

J g(t)c(0) exp{td}du T {t) = 0 

for all 0. This says that 

j g+(t)exp{td}dv r {t) = J g-{t)exv{te}dv r {t), (2-75) 

where g+ and g~ are respectively the positive and negative parts of g. Since 
E e (g(T)) exists for all 6, both sides of (2.75) are finite for all 9. Let 0 O be 
interior to ft, and let the common value of both sides of (2.75) be r when 
6 = 0o- Define two probability measures: 

P(A) = i / g + {t)exp{tOo}d»T(t), 
r J a 

Q{A) = - [ g-(t)exp{te 0 }du T (t). 

r J A 

The two sides of (2.75) are 

J exp{t{6 - 0 o ])dP(t) = J exp(t[0 - 0 0 })dQ(t). 

By Theorem 2.64, these are analytic functions of ip = 6 - 0 O . According 
to Theorem C.7, these functions equal their power series expansions in a 
neighborhood, say (-Vo, </>o)> of ^ = 0. The fact that they agree for all real 
values near 0 implies that they have the same derivatives at 0, hence they 
have the same power series expansion around 0, hence they are also equal 



2.2. Exponential Families of Distributions 109 



at imaginary values of tp near 0, hence they are also equal at all imaginary 
values because they are analytic in the region where the real part of ^ is in 
(-^Oj^o)- For tp = m, we get that the characteristic function of P equals 
the characteristic function of Q in a neighborhood of 0. By Corollary B.106, 
it follows that P = Q, hence g+(t) = g"(t) a.e. [vr]. This ensures that 
Po(g(T) = 0) = 1 for all (9. " □ 

As examples, the sufficient statistics from normal, exponential, Poisson, 
and Bernoulli distributions are complete. 

2.2.3 A Characterization Theorem* 

The following theorem characterizes one-parameter exponential families es- 
sentially as those families of distributions with smooth densities on a com- 
mon set with one-dimensional sufficient statistics for all sample sizes. 

Theorem 2.76. Suppose that X u • . • , X n are conditionally IID given Q = 
0 each with density /xi|e('|0)- Let T be a one- dimensional sufficient statis- 
tic. Write 

n 

Ylfx^eixilO) = mi{x)m 2 (t,6). 

i=l 

Define 

M^) = ^ log m 2 (t,0). 
Assume the following conditions: 

1. The set ofy such that f Xl \s(y\0) > 0 is the same for all 9. 

2 - fxi\e(v\0) is differentiate with respect to 0 for each y. 

3 - /*i|e(y|0) is differentiate with respect to y for each 0. 

4- There exists G 0 such that K $0 (t) has an inverse. 

Then, X has an exponential family distribution with a one- dimensional 
natural parameter. 

Proof. Write 

n 

2^ lo S fx, \e{xi \0) = log m 2 (t, 9) + log mi(x), 

i=l 

and define 



qe(r) = K e [K£ (r)] , r(x) = ]T 

i=l 



•"This section may be skipped without interrupting the flow of ideas. 



110 Chapter 2. Sufficient Statistics 



Since K$(t) = qo{Ko 0 (t)) = qe(r{x)) y it follows that 

Thus, we get dr(x)/dxi — v'(xi), and 

1 ^ log/x|e(^) = |;^(r). (2.77) 



Since r is invariant under permutations of the coordinates of x and the 
left-hand side of (2.77) depends on x through Xi alone, both sides must 
depend only on 9. So, we get 

Q- r qe(r) = Cl (9), q e (r) = rci(fi) + c 2 (0), 

K e (t) = ^flo(t)ciW + ca(tf) = ^logma(t,tf), 

\ogm 2 {t,6) = /f«,(t)0i(0) + &(*) + *(«). 

where 0i(0) = Ci(u)dw, for i = 1, 2, and s(t) is determined by boundary 
conditions. It follows that 

m 2 (*,0) = exp{«(t)}exp{^a(fl)}exp{^ o (t)0i(ff)}. 

Thus, we see that the density is in the form of an exponential family with 
k = 1 and 

h(x) = rai(:r)exp{s(£)}, c(0) = exp{02(0)}, 

h{x) = K eo {t), 7n{0) = MO)- D 

There are similar theorems in multiparameter cases, but they have even 
more conditions. We give a different type of theorem characterizing expo- 
nential families by their sufficient statistics in Theorem 2.114. The impor- 
tance of the existence of a fixed-dimensional sufficient statistic is twofold. 
First, it means that there is a fixed amount of information that must be 
stored for making inference about 9 regardless of the sample size. Second, 
there is the possibility of using natural conjugate prior distributions as in 
Theorem 2.25. 



2.3 Information 

It seems intuitively sensible to expect more data to provide more informa- 
tion about a parameter or a distribution. Similarly, if a statistic is suffi- 
cient, it should contain all of the information about the parameter, and 
vice versa. To make these ideas precise, we need to define information. 
There are two popular definitions of information: Fisher information and 
Kullback-Leibler information. 



2.3. Information 111 



2.3.1 Fisher Information 

Fisher information is designed to provide a measure of how much informa- 
tion a data set provides about a parameter in a parametric family with 
some smoothness properties. 

Definition 2.78. Suppose that 6 is fc-dimensional and that fx\e( x \9) is 
the density of X with respect to v. The following conditions will be known 
as the FI regularity conditions: 

1. There exists B with u(B) = 0 such that for all 9, df X \e(x\0)/dOi 
exists for x & B and each i. 

f fx\e{x\0)dv(x) can be differentiated under the integral sign with 
respect to each coordinate of 6. 

3. The set C = {x : fx\e(x\0) > 0} is the same for all 0. 

Definition 2.79. Assume that the three FI regularity conditions above 
hold. Then the matrix X x {9) = (0))) with elements 

1x^(0) = Cov 9 (^tog/ X |e(m^Tlog/x|eW)) 

is called the Fisher information matrix about 0 based on X. The random 
vector with coordinates dlogf X \e(X\0)/dOi is called the score function. 
If T is a statistic, the conditional score function is the vector whose ith. 
coordinate is dlogf x \T,e(X\t,0)/dOi. The conditional Fisher information 
given T = t, denoted by T X \r(9\t), is the conditional covariance matrix of 
the conditional score function. 

Here are some examples. 

Example 2.80. Let b be known, and suppose that X ~ N(6,b) given 6 = 6. 
Then, the FI regularity conditions are satisfied and 

^log/x,e(*|0) = 

and l x (0) = 1/6. Here we see that the smaller the known variance is, the more 
information there is in the data about 6. This is intuitively sensible. 

Example 2.81.* Suppose that X - t/(O,0) given 9 = 0. That is, /xie(*|0) = 
6 hw){x). In this case FI regularity conditions 1 and 3 fail, but we can still 
calculate d\ogf x \ e (x\O)/d0 = -1/0. We could then try to define the Fisher 
information to be the mean of the derivative of this, namely J x (0) = 1/0 2 . But 
this function will not have the properties that Fisher information has under all 
three FI regularity conditions. 



*Example 2.81 should actually appear in the text after Example 2.85 on 
page 113. 



112 Chapter 2. Sufficient Statistics 

Example 2.82. Suppose that X ~ Bin(n,p) given P — p. Then 

for x = 0, . . . , n, 



fx\p(x\p) 
log/x|p(x|p) 



7^1og/ X | P (x|p) 



log ^ -f x log(p) + (n - x) log(l - p), 
n — x x n 



x 
P 



1-p p(l-p) 1-p' 



Xx(p) = 



n 



p(l - p) " 

The more extreme P is, the more information an observation has about P. 
Example 2.83. Suppose that Pg says X ~ iV(/x,cr 2 ), where 0 = (/x,a). Then 

/x|e(l|<0 = 7^ exp {"2^ (x "' l)2 }' 



|;log/x|e(x|0) 
9 



x — n 
1 , (x-M) 2 



= 0, 



g-log/x|e(^) 

«" - (t £)• 

A useful result about the score function is the following. 

Proposition 2.84. When the FI regularity conditions hold, the mean of 
the score function is 0. If in addition, T is a statistic, then the conditional 
mean given T of the conditional score function is 0, a.s. [Pd,r]- 

If we can differentiate twice under the integral signs (as in exponential 
families), we obtain 

a^/x|e(*|fl) 



°=/^-^ e( ^ (x) = Efl 



fx\e(X\0) 



Now, use the fact that 
^log/*,eOT) 

{eSe-fx\e(X\6)) f xl s(X\6) - (&/x,e(X|«)) W>) 
= 1 fhe&W 



2.3. Information 113 



in order to conclude 

d 2 



\ogf X \e(X\9) 



= 0 - Cov e (A \ogf x]Q {X\6), ^- log/ X |eW)) = -IxmW). 

This gives an alternative method for calculating Ix(9) when we can dif- 
ferentiate twice under the integral sign. In exponential families with the 
natural parameterization, the situation is even simpler, since the second 
derivative of the logarithm of the density does not depend on the data, 
hence no expectation need be calculated. In this case, 



Example 2.85 (Continuation of Example 2.82; see page 112). The derivative of 
the score function is 



logfx\p(x\p) = 



x — np 



dp2 — -V p2(l-p) ^p(l-p)2- 

The mean of this is -n/[p(l — p)] = -I x (p). 

Suppose that X u . . . , X n are conditionally IID given G = 0 with den- 
sity f Xl \e{x\0). Let X = (Xi,...,X n ). In this case, log/ x , e (*|0) = 
Dr=i lo S fx^eiXilO), a sum of IID random variables conditional on G = 9. 
It follows that the covariance matrix for the sum, namely Jx(0), is n times 
the covariance matrix of one of them, namely T Xl (9). That is, lx(0) = 
nZxAO), so Fisher information adds up over IID observations. In fact it 
is additive over any finite collection of conditionally independent data sets 
(see Problem 39 on page 143). In this sense it measures how much infor- 
mation we have in a data set. Also, the more information the data provide, 
the better we should be able to estimate functions of G. Two such results, 
which will be proven later, are Theorems 5.13 and 7.57. 

There is another sense in which Fisher information measures the infor- 
mation in a data set. Let Y = g{X) be an arbitrary statistic. We will see 
that lx(0) is at least as large as Jy(0). 

Theorem 2.86. Let Y = g(X). Suppose that G is k- dimensional and P e < 
v x for all 0. Then 1x(0)-1y(9) is positive semidefinite. The matrix is all 
Os if and only ifY is sufficient. 

Proof. Define Q e (C) = P^[(X,Y) e C\. By Corollary B.55, Q e < i/, 
where i/(C) = v x ({x : (x, g{x)) E C}), with Radon-Nikodym derivative 

fxy\e{x,y\0) = f x]e (x\0) = /y|e(y|0)/x|y,eW2/, 0). 



114 Chapter 2. Sufficient Statistics 



It follows that 



^log/ X |e(x|0) = — log/y|e(y|fi)+ «jrlog/x|y,eW»»*)i a - s - [Q«]» 



for all 0. We will prove that the two terms on the right-hand side of (2.87) 
are uncorrelated and that the last term is 0 a.s. if and only if Y is sufficient. 

Proposition 2.84 says that the first two expressions in (2.87) have mean 0 
and that the last one has 0 conditional mean given Y, a.s. [P<9,y]. It follows 
from the law of total probability B.70 that for all i and j, 



Hence, the two terms on the right-hand side of (2.87) are uncorrelated. Since 
the conditional mean of the conditional score function is 0, a.s., Proposi- 
tion B.78 says that 



It follows that lx(0) - Iy(0) is positive semidefinite. The difference is all 
0s if and only if, for all i, the conditional score function equals 0 a.s. [Qe]- 
This happens if and only if fx\Y,e{x\y,0) is constant in 0, which means if 
and only if Y is sufficient. D 
One feature of Fisher information, which is worth noting, is that it de- 
pends on which of several equivalent parameterizations one chooses. 

Example 2.88. Suppose that Pe says X ~ iV(/x,a 2 ), where 0 = (^cr). The 
Fisher information matrix was seen in Example 2.83 on page 112 to be 



(2.87) 




lx(0)= M0) + VoIx\y(O\Y). 




Now suppose that we chose the natural parameterization of the exponential 
ily, namely 



2.3. Information 115 




Taking the negative of the matrix of second partial derivatives, we get 

rw -("J 

This is clearly not the same as Ix{g 1 (*?))• 

In general, when changing parameters to H = (7(6), we can use the chain 
rule as follows. If f(0) is a function of fc variables and g is one-to-one, then 

where Oj = g~ l {rj) is the jth coordinate of g~ l . In our case, we need to 
consider f{0) = log f X \e{X\0). It follows that the Fisher information about 
H is the matrix 

F x (r } )=A(r ) )l x (g-\r ) ))A T (r ) ), 

where A(^) is a matrix whose (ij) entry is dgj l (rj)/drii. The reader can 
verify that this method also works in Example 2.88 above. 



2.3.2 Kullback-Leibler Information 

There is another measure of information that has similar properties to 
Fisher information. This measure of information is designed to measure 
how far apart two distributions are in the sense of likelihood. That is, if 
an observation were to come from one of the distributions, how likely is 
it that you could tell that the observation did not come from the other 
distribution? 

Definition 2.89. Let P and Q be probability measures on the same space. 
Let p and q be their densities with respect to a common measure v on that 
space, for example, P+Q. The Kullback-Leibler information in X is defined 
as 

Xx(P;Q) = j \og^p(x)dv(x). 

In the case of parametric families, let 0 and ip be two elements of SI. The 
Kullback-Leibler information is then 

I fx\e(X\ip)) 

If T is a statistic, let p t and q t denote conditional densities for P and Q 
given T = t with respect to a measure v t . Then the conditional Kullback- 
Leibler information is 

ix\T(P;Q\t)= [ log^L(x)cMx). 



116 Chapter 2. Sufficient Statistics 



In general, 1 X (P\Q) ^ Xx(Q;P), so Kullback-Leibler information is not 
a metric. The sum T X (P;Q) + X X (Q; P) is sometimes called the Kullback- 
Leibler divergence [see Kullback (1959)]. Even divergence fails the triangle 
inequality in general, so it is not a metric. 

Example 2.90. Suppose that X ~ N(0, 1) given 8 = 0. Then 

log 7^W = 2^-^ -<*-*>]• 

It follows that lx(6;i>) = (V> - 0) 2 /2. This time Jx(0; V) = Tx(ip\6). 
Example 2.91. Suppose that X ~ £er(0) given 6 = 0. Then 

lo S 7 TTT\ = slog - + (1 - x) log -. 

fx\e{x\rl>) ip 

It follows that 

lx(0; V) = 0 log | + (1 - 0) log 

This time Tx(0; V>) ^ T x (t/;; 0). 

Kullback-Leibler information measures the information in a data set in 
some of the same ways that Fisher information does. 

Proposition 2.92. The Kullback-Leibler information lx{P\Q) > 0, and 
it equals 0 if and only if P — Q. The conditional Kullback-Leibler infor- 
mation T X \T{P\Q\t) > 0, a.s. [P T ], and it equals 0 a.s. [P T ] if and only 
if p t (x) = q t (x), a.s. [P]. (See Definition 2.89.) Also, if X and Y are 
conditionally independent given © and 0,ip € fi, then 

Zx,Y(0^)=l x (0^)+lY(e;iP). 

Theorem 2.93. IfY = g(X), thenl x {0;ip) > I Y (0\1>) with equality for 
all 0 and ip if and only if Y is sufficient. 

Proof. Use the same setup as in Theorem 2.86. 

fx\e(X\0) 

i x (0^) = E,io g ; x|e ; ; 

fy^\0) f xvr%f >{X\Y,9) 

= E ' log W^ 

= l Y {0^)+V e {lx\Y{0\Wj\ >Iy(«;^)> 

where the last line follows from Proposition 2.92. To make the inequality 
into equality, Proposition 2.92 says that we must have 

f x ^ e {X\Y,0) = fx\Y t e( x \ Y >1>)> a - s - t^l* 

But this is true for all 0 and ip if and only if Y is sufficient. □ 
The Kullback-Leibler information tells us how far one distribution is from 
another in terms of likelihood. 



2.3. Information 117 



Example 2.94. Let Cl = (0, 1) and suppose that P 0 says that {X n } < ^Li are IID 
Ber(0). Let X = (Xi, . . . , X n ). Let i/> > 0, and let 9 be discrete with 



f /„\ _ / *o if 2/ = 0, 
/e(v)-| if2/ = v ,. 



Then 

7T 0 <9 x (l-6>) n - 



Pr(0 = 0\X = x) = 



7T O 0 X (1 - 0) n ~ x + (1 - 7To)V> X (l - 1p) n ~ X 



- ('♦Hr s (?)'(£$r)~'. 

where x = ^ZiLi x *- ^ et Pn = x/n. Then 

= (l + exp {(Ix(pn; tf) - Ix(p„; V)) "}) • 

So, the probability of either 6 or ip increases with more data, depending on to 
which one p n is closer in Kullback-Leibler information. 

One advantage Kullback-Leibler information has over Fisher information 
is that it is not affected by changes in parameterization. Another advantage 
is that Kullback-Leibler information can be used even if the distributions 
under consideration are not all members of a parametric family. 

Example 2.95. Suppose that P is the standard normal AT(0, 1) distribution and 
Q is the Laplace distribution Lap(0, 1). Then 

p(x) = -I=exp(-^ ) E P (X 2 ) = 1, Ep(|X|) =v /I 



It follows that 



«(*) = i«p(-|x|), E Q (X 2 ) = 2, E Q (|X|) = 1. 



p(x) 1 2 1 2 



07209, 



Xx(Q\P) = -J log- + 1 - 1 =0.22579. 

If data come from a Laplace distribution, it is easier to tell that they don't come 
from a normal distribution than vice versa. 

Another advantage to Kullback-Leibler information is that no smooth- 
ness conditions on the densities (like the FI regularity conditions) are 
needed. 



118 Chapter 2. Sufficient Statistics 



Example 2.96. Suppose that Pq says that X has a uniform (7(0, 9) distribution. 
For 6 > 0, 

Ix&O + S) = ^ log(^±^) Jito = log(l + |) f 

/ Q v 1 1 



If an observation has a t/ (0, 0) distribution, there is some information to distin- 
guish the observation from one with a £7(0, 9+6) distribution. On the other hand, 
if an observation has a [7(0,0 + 6) distribution, then there is infinite information 
to distinguish the observation from one with a U (0, 9) distribution. The reason for 
this is that there is positive probability that the (7(0, 9) distribution can be ruled 
out entirely. In a sense, this is the most powerful kind of information possible for 
distinguishing distributions. 

There is at least one connection between Kullback-Leibler information 
and Fisher information when they both exist and when two derivatives can 
be passed under the integral sign. In this case, 



0=00 



/- 



d 2 I 

log/ X |e(z|0) f X \B{x\B Q )du(x) 



d6id9 



10=00 



= -E, 0 (J^\ogf X \e(X\0o)^^ 
the (z, j) element of the Fisher information matrix. 

Example 2.97 (Continuation of Example 2.91; see page 116). The second par- 
tial derivative of the Kullback-Leibler information is 

If one plugs in ip = 0, one gets 1/[0(1 - 9)} = Xx(0), the Fisher information. 



2.3.3 Conditional Information* 

We defined conditional Fisher information in Definition 2.79 as the condi- 
tional covariance matrix of the conditional score function. We also defined 
conditional Kullback-Leibler information in Definition 2.89 as the condi- 
tional mean of the logarithm of the ratio of the conditional densities. We 
used these conditional information measures in the proofs of Theorems 2.86 
and 2.93 to show that sufficient statistics contain all of the information in a 



*This section may be skipped without interrupting the flow of ideas. 



2.3. Information 119 



sample. However, in Section 2.1.4, it was suggested that performing infer- 
ence conditional on ancillary statistics makes better use of the information 
available in the actual sample obtained. We can make this idea more precise 
by considering conditional Fisher and Kullback-Leibler information given 
an ancillary. 

Theorem 2.98. Let U be an ancillary statistic. Both Fisher and Kullback- 
Leibler information have the property that the information is the mean of 
the conditional information given U. 

Proof. Suppose that X has a density fx\e given 6 with respect to a 
measures. If it = U(x), then we can write fx\ei x \0) = fu(u)fx\u,e{ x \ u >0)i 
since U is independent of 0. If the FI regularity conditions hold, then 

3 d 
^log/xieMfl) = — log/ x!t /,e(xM). 

Since the mean of the conditional score function is 0, a.s., the mean of 
the conditional covariance matrix equals the marginal covariance matrix 
by Proposition B.78. In symbols, Tx(0) = E 0 I X \u(0\U). Similarly, for 
Kullback-Leibler information, 

fx\e(x\0) = fx\u,e( x \ u ' e ) 
fx\e(x\rl>) fx\uM x \ u ^V 

so that Ix(0\1>) = EX x]u (6;ip\U). □ 
Some data sets have more information and some have less depending 
on the value of the ancillary U. Theorem 2.98 says that the amount of 
information averages out to the marginal information over the distribution 
of 17, but we can make use of the observed value of U to tell us whether 
we have one of the data sets with more or less information. 

Example 2.99 (Continuation of Example 2.38; see page 95). In this example, 
X = (X U X 2 ) with the X< being IID N(6, 1) given 6 = 0. We had U = X 2 - Xi. 
The conditional distribution of X given U can be obtained from the conditional 
distribution of X\ given U, which is N(6 + u/2, 1/2). The conditional score func- 
tion is 2(Xi — 0 — it/2), which has conditional variance equal to 2 for all u. Simi- 
larly, the conditional Kullback-Leibler information isJ X | f/ (^;VW = ((9-'0) 2 for 
all u. Hence, this ancillary does not help distinguish data sets from each other. 

Example 2.100 (Continuation of Example 2.52; see page 100). In this problem, 
there were two ancillaries, Mo and No. We can write the second derivative of the 
logarithm of the density as 

_log/ x|e (X|0) - -^3^)2 " (1^)2 - J2TW " W^W' ( ] 

The Fisher information is lx(6) = N(2 - 0 2 )/[(l - 0 2 )(4 - 0 2 )]. According to 
Problem 43 on page 143, we can find the conditional Fisher information by cal- 
culating minus the conditional mean of (2.101). Conditional on Mo = mo, we 



120 Chapter 2. Sufficient Statistics 



get 

_ 3mo + AT(l-fl 2 ) 
■Lx\M 0 Wrno) - ( 1 _02)( 4 _02) • 

It is clear that this is an increasing function of mo, so the more observations we 
get with first coordinate equal to 0, the more conditional information we have 
given Mo. Conditional on No = no, we get 

T x 2n 0 fl + 2iV(l-fl)-;Vfl 2 
Z*|N 0 (*M = (1 _02 )(4 _02) • 

This is an increasing function of no. It is easy to verify that the means of these two 
are both equal to the marginal information, since E(Mo) = 1/3 and E(iVo) = 1/2. 

In Example 2.100, one might ask which of the ancillaries does a better 
job of distinguishing data sets from each other in terms of information. 
This might be answered by looking at how spread out is the distribution 
of the conditional information. 

Example 2.102 (Continuation of Example 2.100; see page 119). We can com- 
pute Var(Mo) = 2N/9 and Var(ATo) = N/4, so that the variance of Zx\M 0 (Q\Mo) 
is 2/0 2 (> 1) times as large as the variance of Ix\n 0 WNq). Suppose that we are 
interested in the statistic AToo- We can calculate any aspect we wish of the condi- 
tional distribution of iVoo given either M 0 or No (and 6). To see how much more 
Mo distinguishes data sets than does No, Figure 2.103 shows the distribution of 
the conditional mean of Noo given Mo and No for 0 = 0.1 and N = 50. It is easy 
to see how much more the distribution is spread out conditional on Mo than on 
No. Since the variance of the conditional mean of TVoo is greater given Mo than 
given No (2.25 versus 1.125), it follows from Proposition B.78 that the mean of 
the conditional variance given Mo must be smaller (by the same amount) given 




FIGURE 2.103. Distribution of Conditional Means of AToo given M 0 and No 



2.3. Information 121 



Mo than given N 0 . In fact, the values are 4.125 and 5.25, respectively. Because 
this example is sufficiently simple, one can even calculate the probability that 
the conditional variance of iVoo given Mo will be smaller than the conditional 
variance given No. The probability is 0.8346. 



2.3.4 Jeffreys' Prior* 

Fisher information turns out to have a role to play in one popular method 
for choosing prior distributions. Suppose that one desires a method for 
choosing a prior density with the following property. If the parameter 6 
were to be transformed by a one-to-one differentiate function g with differ- 
entiate inverse, then the prior for $ = g(Q) obtained by the method would 
be the same as the usual transformation of the prior obtained for 0 by the 
method. For example, suppose that 9 is a positive parameter and that 
the method produces a prior with density /e(-) with respect to Lebesgue 
measure. Let # = © 2 . The usual method of transformations would make 
the prior for \I> equal to 



1 



We want the method, when applied directly to * to produce this same 
prior. 

A class of methods that have this property is the following. Let h : 
X x il — > H be a function, and define 



/ e (0) = yVare— h(X,6). (2.104) 

To see that this works, let * = 5 (9) and let the new parameter space be 
ft'. Note that h must be modified to h' : X x ft' M by 

h'(x,iP) = h(x,g-\rp)), 
or else the expression on the right of (2.104) makes no sense. Now 



d_ 

dtp 
d 



/i'(x,V) = 



9 h{x,g-\xl>))= ^h(x,9) 



dip 



d 



M1>) = 




"This section may be skipped without interrupting the flow of ideas. 



122 Chapter 2. Sufficient Statistics 



which is just what transformation of variables would give. 

The most popular function h to use for such a method is the logarithm 
of the conditional density h(x,0) = log/x|e(z|0). In this case, 

^h(x,0) = l]Qgfx\e{x\0) 9 

which is the score function. We have already seen that under the FI reg- 
ularity con ditions , Naxodh(X,0)ldO = Xx(0). So, the method says to use 
/e(0) = c^/lx{9) as the prior density, where c is chosen to make the inte- 
gral of /e(0) equal to 1, if possible. If no such c exists, then /e(#) = \/^x(0) 
is often used as an improper prior. This type of prior is called Jeffreys ' prior 
after Harold Jeffreys, who proposed it in Jeffreys (1961, p. 181). 

Example 2.105. Suppose that X ~ Bin(n,p) given P = p. Then we saw in 
Example 2.82 on page 112 that the Fisher information is Ix(p) = n(p[l — p]) _1 . 
This makes the Jeffreys' prior proportional to p~ 1/2 (l - p)~ 1/2 . This is the 
Beta(l/2, 1/2) distribution, which is a proper prior. 

Example 2.106. Suppose that X ~ Negbin(a,p) given P = p. It is not difficult 
to show that the Fisher information is Ix(p) = a(p 2 [l -p])" 1 . This makes the 
Jeffreys' prior proportional to p _1 (l - p)~ 1/2 . This is not a proper prior. 

Interestingly also, Jeffreys' prior in this example is not the same as the prior 
for the case of binomial sampling (Example 2.105). This means that choosing a 
prior by Jeffreys' method has the unfortunate characteristic that it depends on 
something that would normally not be taken into account in a Bayesian analysis. 
For example, suppose that one were to be exposed to a sequence of exchangeable 
Bernoulli random variables one at a time. If one were asked to calculate the 
predictive distribution of each observation before it is observed, one would have 
to ask whether sampling was to continue to a fixed size or to a fixed number 
of successes (or failures) before one could even choose a prior distribution. ^This 
stopping criterion should be irrelevant before the first observation arrives, but 
the method of Jeffreys' prior must take it into account. 

Example 2.107. Suppose that the density with respect to Lebesgue measure of 
X given 6 = 0 is f(x - 0), where / is differentiate. Then 

_log/(X-0)- f(x _ ey 



15 One could imagine situations in which the stopping criterion is chosen by 
someone who has information not available to us. In such a situation, it is possible 
that when we learn what this other person has chosen for the stopping criterion, 
we believe that the choice tells us some additional information that we would 
like to incorporate into our model. For example, suppose that this other person 
decides to stop as soon as five successes are observed. We might then say, Aha! 
We will use the prior p' l (l - p)" 1/2 to reflect this information." But then, we 
discover that we only have time to collect four observations. Jeffreys rule says 
that we have to change back to the prior p~ l '\\ - p)" 1/2 because we will have 
a fixed sample size, even if we believe that the reason the sample size is being 
fixed has nothing to do with P. 



2.4. Extremal Families 123 



The distribution of this quantity given 0 = 0 is the same as the distribution of 
f(X)/f(X) given 6 = 0, hence the variance is constant as a function of 0. This 
means that Jeffreys' prior would be constant. If is an unbounded set, Jeffreys' 
prior is improper. 

For multiparameter problems, a similar derivation is possible. Let fe(0) 
be proportional to the square root of the determinant of the covariance 
matrix of the gradient vector of h{X,6) with respect to 0. It is easy to 
check that the gradient of with respect to is equal to the 

matrix whose determinant is the Jacobian times the gradient of h(x,0) 
with respect to 0 evaluated at 0 = g~ l {\^). In the special case of Jeffreys' 
prior, h is the log of the density and fe{8) becomes the square root of the 
determinant of the Fisher information matrix, Zx(0). 

Example 2.108. In Example 2.83 on page 112, we found that if X ~ N(^a 2 ) 
given 0 = (/z,<r), the Fisher information matrix was diagonal with entries 1/cr 
and 2/cr 2 . The determinant is 2/<j 4 and Jeffreys' prior is a constant over cr 2 , an 
improper prior. The usual improper prior in this problem is a constant over a. 

One interesting feature of Jeffreys' prior is that its definition did not 
depend on the parameter space (except that we were able to take deriva- 
tives). That is, if the parameter space is actually an open subset of the 
set {0 : f X \e(x\0) is a density}, then Jeffreys' prior has the same form. 
Obviously, a different normalizing constant will be required if the prior is 
proper. 

Example 2.109 (Continuation of Example 2.107; see page 122). Suppose that 
the parameter space is actually only the open interval (a, 6), but the conditional 
density of X given 0 = 0 is still f(x - 6). Then Jeffreys' prior is the U(a t b) 
distribution, which is proper. 



2.4 Extremal Families* 

In this chapter we have shown how one can determine a sufficient statistic 
once one has chosen a family of models indexed by a parameter. Lau- 
ntzen (1984, 1988) has developed a theory in which a family of probability 
models is determined once one chooses a sequence of sufficient statistics 
Launtzens theory is general enough to apply to collections of random 
quantities that are not exchangeable and to collections more general than 
sequences. We will consider only the case of sequences here. Diaconis and 
FVeedman (1984) also prove results of a similar nature, and Theorem 2.111 
below is based on their work. 



*This section may be skipped without interrupting the flow of ideas. 



124 Chapter 2. Sufficient Statistics 



2.4.1 The Main Results 

Obviously, it takes more than just a sufficient statistic to identify an inter- 
esting class of probability models. For example, T n = Y17=i Xi is sufficient 
whether the Xi are conditionally IID AT(0, 1) or Ber(6) given © = 0. But 
these two models would not both be considered appropriate for the same 
data. The conditional distribution of X\ , . . . , X n given T n would also be 
useful in identifying a class of models. This conditional distribution can be 
described by a transition kernel. 

Definition 2.110. Let X and T be topological spaces and let B and C 
be the Borel cr-fields. A function r : B x T — > [0, 1] is called a transition 
kernel if, r(-,t) is a probability on {X,B) for every t G T, and r(A,-) is 
measurable for every A € B. 

A transition kernel r is like a regular conditional distribution, except 
that it need not satisfy an equation like j c r(B, £)d/iT W — Pr(X € B,T € 
C), because there is no mention of a marginal distribution for T. Our 
goal, in this section, is to prove a representation theorem for the joint 
distribution of {Xn}™^ under the assumptions that a particular sequence 
of statistics {Tn}™^ are sufficient and that the conditional distributions 
given the sufficient statistics are a particular collection of transition kernels. 

The basic structure we will consider here is the following. Let (5, A) be 
a measurable space. For each n, let (X ni B n ) and (T n ,C n ) be Borel spaces. 
The set X n is the space in which all data available at time n lie, and T n is 
the space in which the sufficient statistic at time n lies. Let T n : X n — ► T n be 
measurable, and let p n -i, n : X n -+ #n-i be onto and measurable. Then T n 
is the sufficient statistic at time n, and p n -i,n is the function that extracts 
the data available at time n - 1 from the data available at time n. Let X 
be the following subset of fl^Li Xn' 



X = {X = (X 1 ,X 2 , ) € Yl X n ' Pn-l,n( X n) = *n-l, all 7i > 1}. 



(It is easy to see that A* is in the product a-field.) The set X is the set of 
sequences of possible data which are consistent in the sense that the data 
at time n are an extension of the data at time n - 1 for all n > 1. Define, 
for k < n, 

Pk,n = Pfc,fc+l(Pfc+l,fc+2(' ' ' (Pn-l,n(-))))- 

Let B be the Borel <r-field of X. Let p n : X -> # n be the nth coordinate 
projection function p n (xi,x 2 , . . .) = x n . Let X : S X be measurable, 
and define X n = Pn(X). The definition of X makes it clear that, for all 
x £ X, all n, and all k < n, p fc| n(Pn(«)) = That is ' X * = Pfc.n(^n) for 
all n and all fe < n. Let E n be the sub-cr-field of B generated by {Ti(pt)}fL n , 



oo 



n=l 



and set 



oo 




n=l 



2.4. Extremal Families 



125 



For brevity, we will use the symbol T n to stand for T n (p n (X)) or for T n {p n ). 
This makes E the tail cr-field of the sequence of statistics {Tn}^. 

Theorem 2.111. For each n, let r n : B n x T n [0,1] be a transition 
kernel such that 

rn(T^({t}),t) = 1, for all t € T n . (2.112) 

Suppose that the following is true for each n and t G T n +\: 16 

Condition S: Assume that the distribution ofX n +\ is r n +i(-,£). 
Then r n (-,s) is a regular conditional distribution for X n given 
T n = s for all s € Tnipn^T^it}))). 

Let M be the set of all distributions on (X, B) such that r n is a version of 
the conditional distribution of X n given T n for all n. Then M is a convex 
set. Let £ be the extreme points of M. Suppose that M is nonempty. Then, 
there exists a set E e £ and a transition kernel Q : B x E — ► [0, 1] such 
that 

1. P(E) = 1 forallPeM, 

2. for each x£E, r n (-,T n (a? n )) ^ (?(.,*), 

3. for each P € M, Q is a regular conditional distribution for X given 
E, 

4- for each x e E and AeE, Q(A,x) e {0, 1}, 

5. for each P e M, there is a unique probability R on (X, E) such that 

P = J E Q{,x)dR(x\ 

6. the R in part 5 is the restriction of P toll, 

7. for each P eM, P ££ if and only if P({x : P = Q(., x )}) = 1. 

If the distribution of X is in the class M, we say that {X n }£° =1 is partially 
exchangeable 17 relative to the sequences {T n }~ x and {r„}~ v The family 
of distributions in £ is called the extremal family. It is helpful to comment 
on the conditions in Theorem 2.111. Equation (2.112) says that r n (-, t) puts 



Condition S is a way of saying that X n is conditionally independent of T n+1 
given T n . The problem with saying it this way is that such a statement requires an 
explicit distribution for X n +i, and such a distribution has not yet been defined. 

Dawid (1982) refers to the type of models considered here as intersubjective 
models. The reason is that all of the distributions in M have a common con- 
ditional distribution for the data X n given T n , and they only disagree on the 
marginal distribution of T n . 



126 Chapter 2. Sufficient Statistics 



all of its mass on the set of points where T n = t, so that it really looks like 
a conditional distribution given T n = t. Condition S is a way of expressing 
the idea that T n is sufficient without introducing parameters. It says that 
conditioning X n +\ on T n +i and then looking at the conditional distribution 
of X n given T n is the same as conditioning X n on T n from the start. 

The most common situation in which the above conditions hold is that 
in which X n = y n and B n = V n for some Borel space (}>,£>). In this case 
Pn,n+i(2/i, • • • , 2/n+i) = (yi, . . . , y n ), and there is a bimeasurable function 
w : X -> given by w((yi, (yi, y 2 ), • • •) = (yi, 2/2, • • •)• K, in addition, the 
Yi are exchangeable, we have a special version of Theorem 2.111. 

Theorem 2.113. Let {y,V) be a Borel space and let X n = y n , B n = V n . 
Let Yfc : S — ► y be measurable for all k. Define X as above and define 
X : S -> X by ls 

x = (y 1 ,(y 1 ,y 2 ),(r 1 ,y 2 ,y 3 ),...). 

Let w : X — > y°° be defined by 

w((vu (2/1,2/2), (yi, y2, ya), ...)) = (2/1,2/2, ya, • . •)• 

Assume the conditions of Theorem 2.111, but replace condition S by the 
stronger 

Condition T: Assume that the distribution o/X n +i is r n +i(-,£). 
Then r n (-,s) is a regular conditional distribution for X n given 
F n +i and T n = s for all s € T n (p n , n +i(T- + VW)))- 

Suppose that for every n andt f r n (-,t) is the distribution ofn exchangeable 
coordinates, and that T n is a symmetric (with respect to permutations of 
the arguments) function of Yi, . . . , Y n . Let M* and £* be the sets of distri- 
butions on (y 00 ,!) 00 ) induced by w from those in M and £, respectively. 
Then all elements of M* are distributions of exchangeable random quanti- 
ties. Also, £* is the set of all IID distributions in M* for which the coor- 
dinates have distribution equal to limits (as n — ► 00^ of r n (pi }n ('),T n (x n )) 
for (xi,x 2 ,...) € E. 

Exponential families are a special case which can be characterized by 
their transition kernels. A generalization of the following theorem was 
proven by Diaconis and Freedman (1990). What this theorem says is that 
if an extremal family has the same sufficient statistics and conditional dis- 
tributions as an exponential family, then those members of the extremal 
family with densities are the exponential family distributions. There may 
also be degenerate distributions in the extremal family which would not be 
part of the exponential family. (See Example 2.117 on page 128.) 



In this way X n = (Yi, ■ • • > Y n ). 



2.4. Extremal Families 127 



Theorem 2.114. Let h : M k -> [0,oo) be strictly positive on a set of 
positive Lebesgue measure. Suppose that there exists 0 such that 

c(0) = J h(x)exp{0 T x)dx < oo. (2.115) 

Let y = M k = T n for all n. Suppose that T n (t/i, . . . ,y n ) = Vi- Let 
h^ = h and 

fc< n >(f) = y hS«- l \t-y)h{v)dy, 
for n> 1. For B C X n and t G 2R fc , let 



B(t) 

Let 



= |(2/i,...,2/ n -i) : (^yu - ,y n -i,t -Y^y^j e #j . 



rn(B ' ° = ^ X w ^ (* - £ ») n h{yi)dyi • • • dy ^- 

Then the conditions of Theorem 2.113 are satisfied. Also, the members of 
the extremal family with distributions absolutely continuous with respect to 
Lebesgue measure are the members of the exponential family with the Yi 
being IID with density h(y) exp(6 T y) / c(0) , for some 0 satisfying (2.115). 

2.4.2 Examples 

In this section, we present examples of the above theorems for exchange- 
able sequences. For more general distributions, some examples are given in 
Section 8.1.3. The theorems can be summarized as follows. Suppose that 
we specify a sequence of sufficient statistics and conditional distributions 
for {y n }~ ! given the sufficient statistics in such a way that the sequence 
is exchangeable. Then the X n are conditionally IID with distribution being 
one of the limits of r n (p 1>n (.),T n (z n )). Examination of these limits should 
reveal the collection of extremal distributions. 

The most straightforward example of Theorem 2.113 is to show that it 
implies DeFinetti's representation theorem 1.49 for random variables. 

Example 2.116. Let X n = 1R-. Let T n be the subset of 1R W with the coordinates 
in nondecreasing order. Let r n (A,t) equal 1/n! times the number of permuta- 
tions of the coordinates of t which are elements of A. Let p n -i, n ( x u • • • , x n ) = 
On, . . . , x n -i). Clearly, every IID distribution is in M, so M is nonempty. Now, 
suppose that Xn+i has distribution p n +i (■,*). The set Y(t) = Pn^+iCC^ ({*})) 
for t € T n +i is just the set of vectors of length n whose coordinates are n draws 
without replacement from the coordinates of t, or equivalently, the set of vectors 
consisting of the first n coordinates of the permutations of the coordinates of t 
The distribution of X n is uniform over these (n+ 1)! points (with repeats counted 



128 Chapter 2. Sufficient Statistics 



as more than one point), and the distribution of T n is uniform on T n (Y(t)), which 
consists of the n+ 1 vectors obtained by removing one coordinate from t. The con- 
ditional distribution of X n given T n = s is clearly r n (-,s) for each s 6 T n (Y(t)). 
Hence the conditions of Theorem 2.111 are satisfied. 

A combinatorial argument like the one used in the proof of Theorem 1.49 shows 
that each limit point of a sequence {r n (Pk t n{-),T n (x n ))}^Li of probabilities (for 
fixed k and x) has IID coordinates. Hence, Q(-, x) is a distribution of IID random 
variables for each x. We can determine which IID distribution by looking at the 
first coordinate. Since the r n (pi >n (-) J T n (x n )) distributions are just the empirical 
probability measures of the first n coordinates of x, Q(-,x) is the limit of the 
empirical probability measures for those x such that the limit exists. Since all 
CDFs on IR are such limits, we get DeFinetti's representation theorem 1.49 out 
of Theorem 2.111. 

When (T n ,Cn) is the same space (T,C) for all n, it may be possible to 
identify the extremal distributions with elements of T. 

Example 2.117. Suppose that X = IR and T n = IRxlR +0 with T n (xi, . . . ,x n ) = 
(XX=i Xi > Sr=i x * )' an( * rn ( ' k)) the uniform distribution on the surface of 
the sphere of radius \Jt<i — t\/n around (t\ , . . . , t\)/n. If an n-dimensional vector 
Y is uniformly distribut ed on the sphere of radius 1 around 0, then r n is the 
distribution of t\/n + yjti- t\/nY. So, we will find the distribution of Yi. The 
conditional distribution of (Y^ • • , V^) given Y\ — y\ is uniform on the sphere 
in the (n — l)-dimensional space in which the first coordinate is y\ with radius 

^J\ — y\ around the point (j/i,0, . . . ,0). The marginal density of Y\ is then the 
ratio of the surface areas of these two spheres. The surface area of sphere of radius 
r in n > 1 dimensions is 27r n/2 r n ~ 1 /r(n/2). So 

Let <Jn = (*2 - ti/n)/n. Then, the density of X\ given T n = (ti.fc) is 

Since , v 

we have that 




/x 1 |T n (x|*i,fa)~(27r) 1 



If cr n converges to a and ti/n converges to /x, this function converges uniformly on 
compact sets to the AT(/x, a 2 ) density. If a n goes to oo, the limit is 0 and does not 
correspond to a probability distribution. If a n converges to 0 and ti/n converges 
to /i, the density goes to 0 uniformly outside of every open interval around /x; 
hence the limit distribution is concentrated at /i. If cr n converges to 0 but ti/n 



2.4. Extremal Families 129 



goes to ±00, the limit is not a probability distribution. Hence the set £ consists 
of all IID distributions in which the coordinates are either normally distributed 
or constant. 

Theorem 2.111 is actually so general that it applies to all joint distribu- 
tions. 

Example 2.118. Let {^ n }S=i be a sequence of arbitrary Borel spaces. Let 
Xn = %i — 117=1 an( * Tn be the identity transformation on X n . Let 
r n (A,t) = I A (t). Let p n -i,n(yi, • . . ,2/n) = (1/1, . . • ,2/n-i). Then the conditions of 
Theorem 2.111 are satisfied, and the extreme points of M are the point mass 
distributions Q(A, x) = Ia(x) for all A G B. The tail a- field is the whole a-field 
and the representation probability for P is P itself. Needless to say, this is not 
an interesting example of the representation, but it is an example. 

2.4.3 Proofs* 

The proof of Theorem 2.111 will proceed by means of a sequence of lemmas. 
The following simple proposition implies that M is a convex set. 

Proposition 2.119. Let P and Q be probability measures on a measurable 
space (y,C), and let R = XP + (1 - X)Q with 0 < A < 1. Let V be a 
sub-a -field ofC such that P{-\V) = Q(-\V). Then R(-\V) = P(-\V). 

Next, we prove that the conditional distribution of X n given {Ti}g :n is the 
same as the conditional distribution given T n alone. 19 

Lemma 2.120. For each n and each P £ M, X n is conditionally inde- 
pendent of {Tn+i}^ given T n . 

Proof. We will prove this by showing that the conditional distribution 
of X n given T n , . . . ,T m is the same as the conditional distribution of X n 
given T n for all m. It will follow from the result in Problem 13 on page 663 
that this is also the conditional distribution of X n given {Te}^l n . 

For each P € M, r n +i(-,£) is the conditional distribution of X n +\ given 
^n+i = t. Condition S says that r n (-,s) is the conditional distribution of 
X n given T n = s and T n +i = t. But r n is the conditional distribution of X n 
given T n , so the result is true if m = n+ 1. We finish the proof by induction 
on m. Suppose that the conditional distribution of X n given T n , . . . ,T m 
is r n . Now, find the conditional distribution of X n given T n , . . . , T m +i. 
The conditional distribution of X m +i given T m +i is r m +i, so condition 
S says that the conditional distribution of Xm given (T m ,T m +i) is r m , 



+ This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 

19 This is a stronger statement of the fact that T n is sufficient. If we think of 
{Te}gt n as the parameter, then Lemma 2.120 is the usual classical concept of 
sufficiency. 



130 Chapter 2. Sufficient Statistics 



the conditional distribution of X m given T m . 20 Since (X n , T n , . . . , T m ) is a 
function of X m , its conditional distribution given (T m , T m +i) is the same as 
its conditional distribution given T m . According to Theorem B.75, we can 
use this last conditional distribution to find the conditional distribution of 
X n given T n , . . . , T m +i by conditioning X n on T n , . . . , T m . By the induction 
hypothesis, this just produces r n , and the proof is complete. □ 

It follows from Lemma 2.120 that for every n, r n (-,t) can be used as a 
version of the conditional distribution of X n given T n = £,T n +i = u, — 

Next, we find the conditional distributions of X k given E for each k. 
These distributions are limits of the conditional distributions given T n as 
n — * oo. 

Lemma 2.121. For each x G X, and each n and k < n, let Rk,n,x be the 
probability on (X k ,B k ) induced by p n , k from r n (-,T n (p n (x))). 21 Define L 
to be the set of all x G X such that R k , n ,x converges in distribution (as 
n — ► oo) for all k (denote the limit by R k , x ). Then L G E, and P(L) = 1 
for all P G M. Also, the function f(x) = Rk, x (A) is measurable for all 
A G Bfc and is a version of P(p^ l (A)\T,) for all P G M. 

Proof. For each fc, let fa : Xk -> [0, 1] be a bimeasurable function. (See 
Definition B.31.) Let Y k = 4>k{X k )- Let Qk,n,x be the probability induced 
on [0,1] from Rk, n ,x by fa. Let / : [0, 1] — ► 1R be a bounded continuous 
function. By Lemma 2.120, 



Now, define g h%n {x\f) = f f{y)dQ k ^x{v)' lt follows that E(/(n)|T n = 
t) = 0fc,nte/) if * = T n (p n (x)). According to part II of Levy's theo- 
rem B.124, E(/(y fc )|T n ,r n+ i f ...) converges almost surely to E(/(Y fc )|E). 
In terms of the points x G X, we say this as follows. First, we note that, 
as in Lemma 2.120, a version of the conditional distribution of X n given 
{T^}g n is r n for all P G M. Hence, versions of the conditional distribu- 
tion can be chosen so that the set of x for which g k ,n{ x i f) converges does 
not depend on P. Let G k j be the set of x for which g k , n {x\f) converges, 
and call the limit \ k (x; /). (Hence, \ k does not depend on P either.) Then 
G kJ G E and P(G k j) = 1 for all P G M. Also, A fc (-;/) is measurable 
(with respect to E), and it is a version of E(/(Ffc)|E), that is, for all A G E, 



for all P £M. 



20 Note that we now have that X m is conditionally independent of T m +i given 
Tm because a marginal distribution of X m +i has been identified. 
21 In symbols, ftfe,»,«(B) = r n {pn,k(B),T n {pn{x))) = r n {pn t k(B),Tn{x n )). 



E(/(n)|T n ,T n+1 ,...) = E(/(0 fc (p n|fc (X n )))|T n ,T n+ i,...) 
= E(f(Y k )\T n ). 




2.4. Extremal Families 131 



Let Co be a countable dense subset of the set of bounded continuous 
functions from [0, 1] to 1R (see Lemma B.42) using the uniform metric on 
functions. Let 

fec 0 

so that GjfcGE and P(G k ) = 1 for all P e M. Since the elements of C 0 are 
dense in the uniform metric, we have that for every bounded continuous 
/ : [0,1] -+ 1R and each x € G k) Y\m n -+oo9kA x 'if) = <Mx;/). This can 
also be written as 

J!™, / f(y) d Qk,nAy) = J f(v)dQ k Av)> 

where Q kyX is the conditional distribution of Y k given E calculated from the 
probability space (X, By P), which is the same for every P 6 A4. In short 
Qk,n, x -+ Qfc,*. Also, the function g k (x\f) = / f(y)dQ k>x (y) is measurable 
with respect to E and is a version of E(/(V fc )|E). Because Q*,* is a proba- 
bility on [0, 1] rather than on X k , we need to show that Q k , x {<t> k {X k )) = 1, 
a.s. Since ^(ar; /) is measurable even if / is only bounded and measurable 
(i.e., not necessarily continuous), it follows that g k (x; f) is also the condi- 
tional mean of f(Y k ) given E for all bounded measurable /. Set f = I, , x ) 
to get 

J 9k(x; I MXk) )dP{x) = / I* k (p k {x))dP{x) = P(G k n p -\X k )) = 1, 

from which it follows that Q*,,^^)) = 1, a.s. To complete the proof, 
set 

H k = {xeG k : Q ktX (MXk)) = 1} 

and L = C\f =1 H k , and let J2 fciX be the probability induced on (X k ,B k ) by 
<j> k fromQ fc)X . D 

The next lemma says that we can arrange for the Q k x probabilities to 
be a consistent set of distributions as k varies. 

Lemma 2.122. In the notation of Lemma 2.121, let 

C = {xeL: R k , x (p k l 1<k (-)) = R k - ltX (-), for all k}. 
Then C € £ and P(C) = 1 for all P e M. 

PROOF. As in the proof of Lemma 2.121, let C 0 be a countable dense set 
of bounded continuous functions from [0, 1] to H, and let 



9k(x;f) = J f(y)dQ k , x (y) 



132 Chapter 2. Sufficient Statistics 



for each fc and each bounded measurable /. Then both g k -\(x\f) and 
0fc(£;/(Pfc-i,fc)) are versions of E(Yfc_i|E). Let H k j be the set of x G L 
such that the two versions are equal, and let 

oo 

c = n n 

k=i fec Q 

Each H kJ G E and P(H kJ ) = 1 for all P e M. Since C 0 is a dense set, 
x e C implies g k -\{x\ f) = g k (x\ f{p k -i,k)) for all bounded continuous /. 
□ 

Next, we combine the consistent conditional distributions on the X k 
spaces into a conditional distribution on the space X given E. 

Lemma 2.123. There exists a transition kernel Q : B x C — > [0, 1] such 
that Q(A, x) is a version of P(A\E) for all A G B and all P e M. 

Proof. Lemma 2.122 says that the finite-dimensional distributions R ktX 
are consistent for each fixed x € C. Theorem B.133 says that for each 
x G C, there is a unique probability Q(-,x) on (X,B) with R k)X as the 
fc-dimensional marginal for every fc. But surely, for each P G M, P(*|E) is 
such a probability, so is a version of P(-|E) for every P e M. □ 

If E C C, then we have established parts 2 and 3 of Theorem 2.111 
(see Lemma 2.128 below.) Next we show that the probabilities Q{-,x) are 
mostly in M. 

Lemma 2.124. Let V = {x € C : Q(-,x) € M}. Then V G E and 
P(V) = lforallPeM. 

Proof. A point x e V if and only if, for all n, all A G S n , and all B G C n , 



•"' <T -" <B,) (2.125) 
Both sides of (2.125) are E measurable functions of x, so the set of x for 
which (2.125) holds for fixed n, A, and B is in E. Since X n and T n are Borel 
spaces, there exist countable fields of sets which generate their respective 
Borel (j-fields (Proposition B.43). The set of all x such that (2.125) holds 
for all A and B in those countable collections and all n simultaneously is 
therefore in E. But it is easy to see that if (2.125) holds for all A in a field 
that generates B n (for fixed B and fixed n), then it holds for all A G B n , 
and similarly for all B. Hence V G E. To show that P(V) = 1 for all 
P G M, let P G M and let G G E. Since Q(-,z) is a version of P(-|E), it 
follows that the integral of the right-hand side of (2.125) over G is 



/ Q({T n (X n ) G £,X n G A},x)dP(x) = P(G,T n (X n ) G -B, X n G A). 



2.4. Extremal Families 133 



Similarly, the integral of the left-hand side of (2.125) over G is 

/ / r n {A,T n {p n {y)))dQ{y,x)dP{x) 
JgJ p - 1 {t-\b)) 

= E(I G I B (T n (X n ))r n (A,T n (X n ))) = P(G,T n (X n ) £ B y X n E A), 

where the last equality follows from the fact that G and {T n (X n ) £ B} are 
both in E n and r n is a conditional distribution for X n given E n . Now, let G 
be the set of x such that the left-hand side of (2.125) is strictly greater than 
the right-hand side. If P(G) > 0, we have a contradiction, and similarly 
if G is the set of x such that the left-hand side is strictly less than the 
right-hand side. □ 
There may be many x for which Q(-,a?) are the same, and it would be 
useful not to distinguish these if we want the representation to be unique. 

Lemma 2.126. Let E' be the smallest a-field of subsets ofV such that all 
of the functions f A (x) = Q(A,x) (as functions of x) are measurable. The 
a-field E' is countably generated. 

Proof. The a-field E' is generated by all sets of the form {x e V : 
Q(A,x) > g}, where q is a rational and A is an element of a countable 
field that generates A (Proposition B.43). □ 
Next, we show that for each PeM,S and E' differ only by probability 
zero sets. 

Lemma 2.127. For each A e E, let A f = {x € V : Q(A,x) = 1}. Then 
A f e E' and P(AAA') = 0 for all P e M. 

Proof. Since Q(A 9 •) is measurable with respect to E', A' e E' Now 
AAA 9 = (A \ A') U {A' \ A). Since P(Q(A, X) = I A (X)) = 1 for all A e E, 
Q\A, x) = 0, a.s. for x 6 A', and we get 

P(A\A')= [ Q(A,x)dP(x) = 0. 
Since Q(A,x) = 1 for all x e A!, 

P(A f \A)= f [1 - Q(A, x)]dP{x) =0. □ 

J A 1 

Next, we identify the set E. 

Lemma 2.128. For each xeV, let S(x) be the atom in S' containing x, 
that is, 

S{x) = {y€V: Q(A,y) = Q(A,x), for all A G E'}. 

Let E = {x £ V : Q(S(x),x) = 1}. Then E € £' anrf P(£) = 1 for a2£ 
€ M. Also, 

E={xeV: Q{x, A) = I A ( X ), for all A 6 £'}. (2.129) 



134 Chapter 2. Sufficient Statistics 



Proof. First, we prove (2.129). Suppose that x G V and Q(A,x) = Ia(x) 
for all A G E'. Since S(x) G E', we have Q(S(x),x) = I S ( x )(x) = 1, since 
x G S(x). So x G E, and the right-hand side of (2.129) is contained in E. 
If x € E and AeE', then S(x) C A if and only if x G A It follows that 
Q(A,z) = /^(x), and £ is a subset of the right-hand side of (2.129). 

Next, we prove that E G E'. Note that Q(A,x) = I A (x) for all A G E' 
if and only if Q{A,x) = /a(^) for all A eV, where P is a countable field 
generating E'. So, 

£ = p| {* G V : Q(i4, x) = I A (x)}, (2.130) 

AGP 

which is in E' because each of the sets in the intersection is in E'. 

Finally, we prove P(E) = 1 for all P G M. Since Q(A,x) is a version of 
P(A|E) for all x G C by Lemma 2.123, we have that for all A G E, 

P(Q(A,X) = / A (X)) = 1. 

Now use (2.130) again, to conclude P(E) = 1. □ 
Since E C C, we have now established parts 1, 2, and 3 of Theorem 2.111. 
Next, we establish part 4. 

Lemma 2.131. // x G E and A G E, f/ien <2(j4,a;) £ {0, 1}. 

Proof. Let x G E and A G E. Then Q(-,x) G At by Lemma 2.124. 
By Lemma 2.127, there is A! G E' such that Q(A,x) = Q(A',x). But 
Q(A',x) = J A /(x) by Lemma 2.128. □ 
Now, we are ready to prove parts 5 and 6 of Theorem 2.111. 

Lemma 2.132. Let E* be the a-field of subsets of E defined by 

E* = {4n£:,4GE'}. 

For each P G M, there is a unique probability R on (£, E*) such that 
P = J E Q(-,x)dR{x) and R is the restriction of P to E* as well as the 
restriction of P toU. 

Proof. Since Q(A,x) is a version of P(A|E), it follows that R equal to 
the restriction of P to E* satisfies the representation. To show uniqueness, 
let i?bea probability on (E, E*) which satisfies the representation and let 
AGE*. Then 

R(A)= ( I A (x)dR(x)= [ Q(A,x)dR(x) = P(A), 
Je Je 

where the second equality follows from Lemma 2.128, and the third follows 
from the representation. Since E* C E, the restriction of P to E agrees 
with the restriction of P to E* on E*. D 
The following result follows easily from Lemma 2.132. 



2.4. Extremal Families 135 



Corollary 2.133. If P and P' are in M and they agree on £*, then they 
are the same. 

Finally, we can prove part 7 of Theorem 2.111. 

Lemma 2.134. A probability P G M is extreme if and only if it is a 
zero-one measure on E. Also, P G M is extreme if and only if 



Proof. According to Lemma 2.127, P G M is a zero-one measure on 
E if and only if its restriction to E* is a zero-one measure. P is a zero- 
one measure on E* if and only if it is concentrated on one of the atoms, 
which are sets of the form {x G E : Q(-,x) = R}, for some R. But the 
representation in Lemma 2.132 implies that R = P. So, P is a zero-one 
measure on E* if and only if (2.135) holds. 

Next, we prove that if P G £ , then P is a zero-one measure on E*. 
Suppose, to the contrary, that there is A G E* such that 0 < P(A) = a < 1. 
Let 



where R is the restriction of P to E*. Clearly, P\{A) = 1 and Pi{A) = 0, 
so P\ ^ Pi- But P = aP\ + (1 - a)P2, so P is not extreme. 

Finally, we prove that if P is a zero-one measure on E*, then P is 
extreme. Suppose, to the contrary, that P = aP\ + (1 - a)P2 for some 
0 < a < 1 and Pi ^ P 2 in At. Since ^ « P on E* for i = 1, 2, it follows 
that P, Pi, and P2 all concentrate on the same atom in E*, hence they 
agree on E*. By Corollary 2.133, P = Pi = P2, a contradiction. □ 

In particular, Lemma 2.134 says that we can locate the extreme points 
by finding all of the Q(-,x) measures for x G E. 

Lemma 2.136. A probability P G M is in £ if and only if it is a Q(-,x) 
for some x G E. 

PROOF. Lemma 2.134 says that x G E implies Q(-,x) is extreme. (Check 
the definition of E in Lemma 2.128.) Conversely, if P is extreme, then P is 
a zero-one measure on E* and it equals Q(«, x) for x in the atom on which 
P concentrates, according to (2.135). □ 
The proofs of Theorems 2.113 and 2.114 require a lemma first. 

Lemma 2.137. Let (y,V) be a Borel space, and assume the conditions of 
Theorem 2.111 hold with X n = y n . Let X n be {Y u . . . ,Y n ). Then all IID 
distributions in M* are in £ *. 

Proof. Let A n be the <r-field generated by all functions from X n to T n 
which are symmetric with respect to permutations of the coordinates. Let 
Ax> = H^^An. Then E* C Aoo, where E* is the image of E under the 



P({xeL:Q(.,x) = P}) = l. 



(2.135) 




136 Chapter 2. Sufficient Statistics 



bimeasurable mapping w. We will prove that the IID distributions are zero- 
one on Ax>. We do this by proving that IID distributions are conditional 
distributions given Aoo- Let P stand for the distribution of the data which 
says that the are IID with distribution equal to the limit of the empirical 
CDFs. Since the empirical CDF based on Y\, . . . , Y n is A n measurable, and 
since A n C A n +i for all n, it follows that P is A n measurable for every n, 
hence it is Aoo measurable. To see that P(B) — Pr^Y^ , . . . , Y ik ) 6 B|*4oo)j 
we need to prove that, for every C € Aooi 



The proof of this is virtually identical to the proof of (1.84) on page 47 
and will not be repeated here. This means that the IID distributions are 
conditional distributions given Aoo, hence they are 0-1 on S* and in the 



Proof of Theorem 2.113. To see that each distribution in M* is the 
distribution of exchangeable random quantities, let ii,...,ifc be distinct, 
let n = max{zi, . . . , u}, let B e B k , and let rj € M*. Let 



Since r n (-, t) is the distribution of exchangeable random quantities for all n 
and t, we have r n (A, t) = r n (A ; , t) for all n and t. Let r) Tn be the distribution 
of T n . Then 



r/((4,...,I ik )GB) = J r n (A y t)drfr n {t) 

= [ r n {A\t)drtr n « = v((Xu • ■ • M e B). 



Next, we want to prove that the IID distributions are the extremal distri- 
butions. Lemma 2.137 says that the IID distributions in M* are contained 
in £*. We now prove that all extremal distributions are IIDIt follows from 
DeFinetti's representation theorem 1.49 that every distribution rj in M* 
is a mixture of IID distributions. We show next that if rj € £ *, then the 
mixture must be trivial. Let rj £ £*, and represent 



as in DeFinetti's representation theorem 1.49, where P n is the distribution 
that says that X n = (Yi, . . . ,Y n ) are IID with distribution P. Let q(P) be 
the joint distribution on (y 00 ,© 00 ) which says that {Y n }%> =l is IID with 
distribution P. Since condition T says that X n is conditionally independent 
of {^n+i}^i given T n , and since P is a function of {r n +i}.~i. it follows that 




extremal family. 



□ 



A = {(#1, . . . , x n ) : [xi x , . . . , Xi k ) G -B}, 
A' = {(xi,...,x n ) : (xi,...,x fc ) e B}. 




(2.138) 



2.4. Extremal Families 137 



X n is conditionally independent of P given T n . This means that the condi- 
tional distribution of X n given T n = t and P is r n (-, t). Hence, q(P) € M* 
with probability 1. This makes (2.138) a representation of 77 as a mixture 
of elements of M*. Since 77 € £*, the mixture must be trivial, that is, 
q(P) = 77 with probability one. So, we have that all distributions in £* are 
IID distributions. 

The last claim in the theorem follows from the fact that each element of 
£ is the limit of r n probabilities. □ 
Proof of Theorem 2.114. Since r n (JR k ,t) = 1 for all t, it is clear that 
r n is a transition kernel. Since B(t) = B n J" 1 ^}), r n satisfies condi- 
tion 2.112. Since every member of the exponential family in question has 
the conditional distribution of (Y 1? . . . , Y n ) given T n equal to r n , it follows 
that condition S holds and that every member of the exponential family 
is also in M. Since F n +i = r n+i - T n , Proposition B.28 says that the 
conditional distribution of X n given (T n , F n+1 ) = (£, y) is the same as that 
given (r n ,r n+ i) = (t,t + y), so condition T holds. So the conditions of 
Theorem 2.113 hold. Lemma 2.137 says that every member of the expo- 
nential family is in the extremal family. We now prove that all extremal 
distributions are in the exponential family. We may assume, without loss 
of generality, that h(0) > 0. (If not, find c such that h(c) > 0 and subtract 
c from all X { . Then replace h(y) by v(y) = h(y + c) and note that r n has 
the same form in terms of v as it does in terms of h.) Let / be the density 
of a distribution in the extremal family, and let 

/ (2) W = f f(t-y)f(y)dy. 

Then /(« is the density of Y x + Y 2l since Y x and Y 2 are IID in the extremal 
family. Since / leads to the same conditional distributions as ft, we get 



hence /(0) > 0, since h(0) > 0. It also follows that 
fit ~ V)f{y) _ h(t - y) h{y) 

fW(t) hW(t) ' a - e -' ( 2 - 139 ) 

since both sides give the conditional distribution of Yi given Y\ +Y-> = t 
Define 

By taking the log of both sides of (2.139), we get X(t -y) + \(y) = $(t) 
Now, set y = t and note that A(0) = 0, so that \{t) = 4>(t). It follows that 



138 Chapter 2. Sufficient Statistics 



A(t - y) + A(y) = X(t). According to Theorem C.9, A(y) = a + 6 T y for some 
scalar a and vector 6. Hence, f(y) is a constant times h(y)exp(b T y) for all 
y such that f(y) > 0. To see that f(y) > 0 whenever h(y) > 0, note that 
(2.139) implies that f(y) = 0 and h(y) > 0 means that h(t - y)/h™(t) = 0 
for all t. But this would contradict 



2.5 Problems 

Section 2.1.2: 

1. Suppose that P e says that Xi,...,X n are IID TV(0,1), for 0 G IR. Let 
X = (Xi, . . . , X n ) and find a one-dimensional sufficient statistic T. Also, 
find the conditional distribution of X given T = t. 

2. Refer to the definition of P dtT on page 84. If {Pe : 0 G Q} is a regular 
conditional distribution on (X,B) given 0, then prove that {P$,t : 0 6 f2} 
is a regular conditional distribution on (T, C) given 6. 

3. Suppose that X\, . . . , X n are conditionally IID with a 2 ) distribution 
given G = (/i,cr). Find a two-dimensional sufficient statistic. 

4. Let X\ , . . . , X n be conditionally independent given P = p with X» having 
conditional density (with respect to counting measure on the nonnegative 
integers) 

where Qi, . . . , a n are known strictly positive numbers. (These are general- 
ized negative binomial random variables.) Define T = 5^7=1 ^* nc * ^ e 
conditional distribution of (Xi, . . . , X n ) given T = t and P = p. 

5. Let P* say that Xi, . . . ,X„ are IID Poi(0). Show that T = YZ=i Xi is 
sufficient by both Definitions 2.4 and 2.8. 

6. Prove Proposition 2.12 on page 86 and find the conditional distribution of 
(Xi, . . . , X n ) given the order statistics. 

7. Prove Proposition 2.23 on page 90. 

8. For the experiment in Example 2.53 on page 102, find the conditional 
distribution of X given Xi and 0 and show that Xi is not sufficient. Show 
that X is minimal sufficient. 

9. *Consider the experiment described in Example 2.54 on page 102. Let TV be 

the number of observed Yi, X = iC^O, l} m , and X = (Z, Yi , . . . , Y N ). 

(a) Find the density of X given 0 = 0 with respect to counting measure 
on (#,2*). 



2.5. Problems 139 



(b) Let M be the number of observed successes among the Yi, that is, 
M = J^ili^- Show that (N,M) is sufficient. In particular, we do 
not need to keep track of Z. 

(c) Find the conditional distribution of Z given (N, M, 6), and show that 
it does not depend on B. 

10. (Nonexchangeable example) Let Pq say that {X n }£Li are Bernoulli random 
variables with the following joint distribution: 



Pe(Xi = l) = 0i 
Pe(X i = l\X u ...,X i - 1 ) = { xx~-i=o] 



where 0 = (0i,0n,0io). 

(a) Let X = (Xi, . . . , X n ). Find a four-dimensional sufficient statistic. 

(b) Suppose that X = (Xi, . . . ,Xjv), where AT is the number of obser- 
vations until k successes (Is) have been observed, where k is known. 
Find a three-dimensional sufficient statistic. 



Section 2.1.3: 



11. Prove that the sufficient statistic T found in Problem 1 on page 138 is 
minimal sufficient. 

12. Show that T is a complete sufficient statistic in Problem 4 on page 138. 

13. (Logistic regression) Let {VilSi De Bernoulli random variables, but assume 
that each Yi comes with a known vector x% of k covariates. Conditional on 
0 = 0 (a vector of length &), the Yi are independent with 

Let X = (Yi, . . . , Y n ). Find a minimal sufficient statistic (vector). 

14. *Let (Xi, Yi), . . . , (X n , Y n ) be conditionally IID with uniform distribution 

on the disk of radius r centered at (0i,02) in IR 2 given (Oi,02,#) = 
(0i,02,r). 

(a) If (Bi, 02 ) is known, find a minimal sufficient statistic for R. 

(b) If all parameters are unknown, show that the convex hull of the sample 
points is a sufficient statistic. 

15. *Here, we will construct a function T as needed in Theorem 2.29 for the 

general case. The function will turn out to be essentially the likelihood 
function. 

(a) Let T be the space of functions / : Ct — ♦ [0, oo) with the product 
<T-field C. Prove that the function T\ : X — ► T defined by T\(x) — 
fx\e(x\-) is measurable. 



140 Chapter 2. Sufficient Statistics 



(b) Consider the relation ~ on T defined by / ~ g if there exists c > 0 
such that #(0) = cf(0) for all 0. Prove that ~ is an equivalence 
relation. (That is, prove that (i) / ~ /, (ii) / ~ <j implies # ~ /, and 
(iii) (f ~ g and g ~ ti) implies f ~ h.) 

(c) Let T be the set of all equivalence classes [/] = {g : g ~ /}. Let the 
cr-field of subsets of T be the smallest cr-field containing sets of the 
form A e ^ yC = {[/] : /(0) < e/(t/;)}. Prove that [/] 6 ^,^, c if and 
only if [g] 6 ^e,^,c for all g e [/]. 

(d) Let T 2 : — T be denned by T 2 (/) = [/]. Prove that T 2 is measur- 
able. 

(e) Prove that T = T 2 (Ti) satisfies T{x) = T(y) if and only if y 6 
in the notation of Theorem 2.29. 

16. Suppose that {(X n , Yn)}n1=i are conditionally IID given 6 = 0 with distri- 
bution uniform on the disk of radius 0 centered at (0,0), that is, 

fx^rMxuyiie) = 2~£2 J[o,*] (V x i +y?) • 

Let X = [(Xi, Yi), . . . , (X n , Yn)]. Find a complete sufficient statistic and 
its distribution. 

17. Suppose that Q = {(0i, 0 2 ) : 0 2 > 0i} and Pe lt $ 2 says that Xi, . . . , X n are 
IID with t/(0i,0 2 ) distribution. Find minimal sufficient statistics. 

18. *Let X be a discrete random variable, and let 

{0 if x = 0, 

(1-0) 2 *" 1 if* = 1,2,..., 
0 otherwise 

be the density of X (conditional on 0 = 0) with respect to counting mea- 
sure on the integers. Let ft = (0, 1). Prove that X is boundedly complete, 
but not complete. 

19. Suppose that P 0 says that {X n }%Li are IID Ber(0). Let X = (Xi, . . . , X n ) 
and T = ]T^ =1 X*. Prove that T is a complete sufficient statistic without 
using Theorem 2.74. 

20. Suppose that P 0 says that {X n }n=i are IID f/(0, 0). Let X = (Xi, . . . , X n ) 
and T = maxi=i,..., n Xi. Prove that T is a complete sufficient statistic. 

21. *Suppose that Xi, . . . , X n are IID given 9 = 0 with conditional distribution 

uniform on the set [0, 0] U [20, 30]. That is, 

fxi\e(x\0) = Yq ( J (o.*]( x ) + *[3*,**)( x )) • 

Find a minimal sufficient statistic (dimension at most 3). 

22. Let Z = (Xi, . . . , X n , Yi, . . . , Y n ) where the Xi and Yi are all condition- 
ally independent with Xi_-_N(/z,<4) and Y { ~ AT(/i,*y) S iven e = 
(/i,ax,ay). Let T(Z) = (X, Y, the usual sample means and vari- 
ances. Show that T is minimal sufficient but that T is not boundedly com- 
plete. 



2.5. Problems 141 



23. *Let Q = {1, 2, 3}, and let X = (Xi , . . . , X n ) be conditionally IID given 9 = 

0, with density fe{-). Suppose that Pi, P 2 , and P 3 have density functions 
(with respect to Lebesgue measure): fi(x) = 7(_i )0 )(x), /2(x) = J( 0) i)(aO, 
and fz{x) = 2x/ (0 ,i)(x). Thus, the model has only three members. Let 
S(x) = {(j, *0; Ili /i(*0 + Ili fk(xi) > 0}. Let 

Show that T is minimal sufficient. 

24. (Nonexchangeable example) Suppose that {X n }%Li is a sequence of ran- 
dom variables, that 9 € ( — 1, 1) is a parameter such that the conditional 
distribution of X\ given 9 = 0 is iV(0, [1 -0 2 ]), and that, for i > 1, the con- 
ditional distribution of Xi given 9 = 0 and (X\ , . . . , Xj_i) = (x\ , . . . , Xi-i) 
is Ar(0Xi_i,l). 22 

(a) If X — (Xi,...,X n ), find a three-dimensional minimal sufficient 
statistic. 

(b) Find the conditional distribution of X n +i given X = x, and show 
that it depends on more of the data X than the minimal sufficient 
statistic. 

Section 2.1.4: 

25. Suppose that P e says that {X n }n=i are IID U(0 - 1/2,0 + 1/2). Let X = 
(Xi, . . . , X n ). Find minimal sufficient statistics and find a nonconstant 
function of the sufficient statistic which is ancillary. 

26. Prove that if S is ancillary, then S and 9 are independent no matter what 
prior one uses for 9. 

27. (A vector example) Suppose that P e says 

(*;),...,(*;)'!P»,( 

with ft = (-1,1). 

(a) Find a two-dimensional minimal sufficient statistic. 

(b) Prove that the minimal sufficient statistic found above is not com- 
plete. 

(c) Prove that Z 1 = ^ X? and Z 2 = Y? are both ancillary but 
that (Zi, Z2) is not ancillary. 

28. *Consider the situation in Example 2.51 on page 100. Prove that the dis- 

tribution of U is uniform on the sphere centered at the vector 0 of n 0s in 
the hyperplane defined by l T u = 0, where 1 is the vector of n Is. (Hint- 
Let A be an orthogonal n x n matrix that maps the hyperplane to itself. 
Prove that AU has the same distribution as U.) 



0 




* 1 


e ' 


0 




0 


1 



Such a sequence is often called an autoregression of order 1. 



142 Chapter 2. Sufficient Statistics 

29. Suppose that Pr(X > 0) = 1 and that the conditional distribution of Y 
given X = x is [7(0, x). Let Z — X — Y and suppose that Y and Z are 
independent. Let fx(x), and fz(z) be differentiate. 

(a) Prove that Pr(X > c) > 0 for all c> 0. 

(b) Prove fx(x) = a 2 xexp(-ax), for x > 0. 

30. Let Xi, . . . , X n be conditionally IID given 0 = 0 each with density g(x — 6) 
for some function g. Prove that max{Xi, . . . , X n } - min{Xi, . . . , X n } is 
ancillary, but not maximal ancillary if n > 2. 

31. Consider the situation in Example 2.46 on page 97. Suppose that we wish 
to condition on Nq if 



and we wish to condition on Mo otherwise. For which data sets would we 
choose AT 0 , and for which would we choose Mo? 

32. Consider the situation in Example 2.46 on page 97. Suppose that we need 
to choose upon which ancillary to condition before we see the data. Suppose 
that we decide to condition on No if 



and we will condition on Mo otherwise. On which ancillary will we condi- 
tion? 

33. Call a statistic U ignorable if there exists a sufficient statistic T such that 
T and U are conditionally independent given 0. Prove that an ignorable 
statistic is ancillary. 

Section 2.2: 

34. Express the family of Poisson distributions in exponential family form. Find 
the natural parameter, natural parameter space, and sufficient statistic. Use 
Theorem 2.64 to find the mean and the variance of the sufficient statistic. 

35. Express the family of Beta distributions in exponential family form. Find 
the natural parameter, natural parameter space, and sufficient statistic(s). 
Use Theorem 2.64 to find the mean and the variance of the sufficient statis- 
tics. (Hint: The derivative of the log of the gamma function is called the 
digamma function ip. The second derivative is called the trigamma function 



36. In Problem 9 on page 138, show that the family of distributions for ob- 
served data is an exponential family but that the sufficient statistic is not 
complete. How do you reconcile this with Theorem 2.74? 

37. Prove Proposition 2.70 on page 107. 

38. Prove Proposition 2.72 on page 107. 





2.5. Problems 143 



Section 2.3: 

39. Suppose that X and Y are conditionally independent given 0 with con- 
ditional densities fx\e(x\6) and ,/V|e(2/|0)> respectively. Suppose that 0 is 
fc-dimensional. Prove that Xx,y(0) = lx{0) + Jy(0). 

40. Prove Proposition 2.84 on page 112. 

41. Prove Proposition 2.92 on page 116. 

42. Suppose that P$ says that X ~ Poi(0). 

(a) Find the Fisher information Jx(0). 

(b) Find Jeffreys' prior. 

43. Suppose that the FI regularity conditions hold and that two derivatives 
can be passed under integral signs. Let T be an ancillary statistic. Prove 
that I X \T(0\t) has (t, j) entry equal to -E* (d 2 log f X \T,e(X\t, 0)/d0id0j). 

44. Let ft = {(pi,P2,P3) : Pi > 0,5^f =1 p» = 1} and 



fx\e{x\PuP2,P3) = < 



pi if x = 1, 

P2 if x — 2, 

P3 if x = 3, 

0 otherwise. 



Let 0 O = (1/3, 1/3, 1/3), and find the value of 0 such that E e (X) = 2.5 and 
1x{0q',0) is minimized. 

45. Suppose that X ~ U(0, 0) given 0 = 0. Find the Kullback-Leibler infor- 
mation Jx(0i;#2) for all pairs (0i,02). 

46. Suppose that person 1 believes Pr(0 = 1/3) = ttq and Pr(0 = 1/2) = 1— no 
and person 2 believes Pr(0 = q) = 1. Both persons believe that {X n }£Li 
are IID Ber{6) given 0 = 0. Let Y n = ]T^ =l Xi/n. 

(a) Find Pr(0 = l/3|y n = q) for person 1. 

(b) For each possible value of q, describe person 2's beliefs about how 
the value of Pr(0 = l/3|Y n ) (calculated by person 1) will behave as 
ft — > oo. 

47. Let Y be the number of patients (out of n) who survive for one year after 
an operation. Let Z be the number of patients who survive for five years. 
Let 0 = (P, Q), and suppose that we model Y ~ Bin(n,p) given 0 = (p, q) 
and Z ~ Bin(y, q) given Y = y and 0 = (p, 9). Let X = (Y, Z). Find J x (0) 
and Jeffreys' prior. 

Section 2.4: 

48. *Let T n (x\, . . . , x n ) — $^=1 Xi ' anc * su PP ose tnat (-^i> • • • » -^n) given T„ = t 

is distributed uniformly on the portion of the hyperplane X^=i Xi = * w ^ n 
all coordinates nonnegative. Find the extremal family of distributions. 

49. *Let X 1 — {0, 1}. Let T n (xi, . . . , x n ) — £^™ =1 and suppose that the dis- 

tribution of X\ , . . . , X n given T n = t is that of draws without replacement 
from an urn containing t Is and n — t Os. Find the extremal family of 
distributions. 



Chapter 3 
Decision Theory 



A major use of statistical inference is its application to decision making 
under uncertainty. When the costs and/or benefits of our actions depend 
on quantities we will not know until after we make our decisions, we need 
to be able to weigh the costs against the uncertainties intelligently. 

3.1 Decision Problems 
3.1.1 Framework 

Suppose that one can determine ahead of time a set of actions from which 
one will have to choose. We name this set H and call it the action space. This 
set will contain all of the actions under consideration. We will occasionally 
need to introduce a measure over this set, so let a be a cr-field of subsets 
of N. In the most general type of decision problem we will consider, we 
suppose that there is a not yet observed quantity V (taking values in a set 
V) on which depends the amount we lose as a function of our action. 

Example 3.1. Suppose that we are trying to decide whether to keep a store 
open for an extra hour during a busy shopping season. We might be able to 
determine the extra costs of overhead and payroll associated with staying open, 
but the amount of additional net sales V is as yet unknown. The final profit or 
loss associated with the decision depends on V . 

Definition 3.2. Let (S,A,n) be a probability space, N an action space, 
and V : S -> V a function. A loss function is a function L : V x N H. 
L(v,a) measures "how much" we lose by choosing action a when V = v. 

Consider the following simple example. 



3.1. Decision Problems 145 



Example 3.3. Let V ~ iV(l, 1), and suppose that N = IR and L(v,a) — (v-a) 2 . 
In this case, the amount I lose when I choose action a is the squared distance 
between a and the unknown V. Alternatively, we might have L(v,a) = 3\v — a\. 
In this case, I lose three times the distance between V and a. 

The conditional distribution of V given 9 = 0 will be denoted by Po,v> 
For convenience we will assume that V and X are conditionally independent 
given O. Most often, in discussions of statistical theory, the function V is 
0, so that Poy(B) — i#(0), and L : fi x N — > JR. But this is not actually 
necessary. The goal of decision theory is to make L( V, a) as small as possible 
by choice of a G N. Unfortunately, this would normally require that we know 
the value V. For example, in Example 3.3, both losses are smallest if a = V. 
If V is a function that depends on the future (coordinates later than those 
observed or the parameter), then we will not know V at the time a decision 
will need to be made. When V = G, we will not even know V after the 
decision is made. 

The tools we use to make decisions are called decision rules. 

Definition 3.4. A randomized decision rule 6 is a mapping from X to 
probability measures on (N,a) such that for every A e a, 6(-)(A) is mea- 
surable. A nonrandomized decision rule 6 is a randomized decision rule that 
for each x assigns probability 1 to a single action, denoted by 6(x). That 
is, a randomized decision rule 6 is a nonrandomized rule if, for each xG^, 
there exists a x € N such that 6(x)(A) = Ia{clx), and in such a case a x is 
denoted by 6(x). 

Note that the definition of a randomized decision rule makes it a regular 
conditional distribution over (N, a) given X. Of course, if one were actually 
to use a randomized decision rule, one would need to choose an action in 
N, not just a probability measure over (N,a). To do this, one takes the 
observed x and simulates (see Section B.7) a pseudorandom element of N 
according to the probability measure 6(x)(-). Hence, an alternative method 
for specifying a randomized rule is to specify, for each possible x, the way 
in which one will simulate the action from N. 

Example 3.5. Suppose that m and n are even integers. Suppose that P$ says 
that Xi f . . . , X n +m are IID Ber{0) random variables. Let X — (Xi, . . . , X n ) and 
V = Y^Li X n +i- Let the action space be N = {ao, ai}, and suppose that the loss 
function is 

{0 if (y < y and a = ao) or if (y > y and a = ai), 
| ifti=s 
1 otherwise. 

Let Y = J^Lj Xi and y = ^Z^ =1 x%. Here is a plausible randomized decision rule: 

{probability 1 on ao if y < f , 
probability Ion d if y > % , 
probability \ on each if y = § . 



146 Chapter 3. Decision Theory 



If Y = n/2 is observed, one could flip a fair coin to decide between the two 
actions. 

If 6 is a randomized rule, set 

L(v,6(x)) = / L{v,a)d6(x){a). 

This then allows us to talk about the loss incurred by either a random- 
ized or a nonrandomized rule without regard to the result of the auxiliary 
randomization in the randomized rule. 

Example 3.6 (Continuation of Example 3.5; see page 145). If y = n/2, then 
one can easily show that L(v,6(x)) = 1/2 for all v. 

3.1.2 Elements of Bayesian Decision Theory 

In the Bayesian paradigm, one calculates the posterior risk 

r(6\x)= / L{v,6(x))diJL V \x(v\x) 
Jv 

for each decision rule and chooses the one with the smallest posterior risk. 
Here, d/jL V \x denotes the conditional distribution of V given X. If we do 
this for every x and the posterior risk is never +oo, the resulting rule is 
called a formal Bayes rule. 

Definition 3.7. If S 0 is such that r(6 0 \x) < oo for all x and r(6 0 \x) < r(6\x) 
for all x and all decision rules <5, then <5 0 is called a formal Bayes rule. 

The use of formal Bayes rules is based on the following principle. 

The Expected Loss Principle: When one compares two rules 
after observing data, the better rule is the one with the smaller 
posterior risk. 

A justification for this principle will be given in Section 3.3. One feature 
of that justification, which we do not use here, however, is that the loss 
function needs to be bounded. 

Example 3.8. Let V C H, N C H, and L(v,a) = (v - a) 2 . Then 

r(S\x) = [ (v - a) 2 dti V \x(v\x) = E(V 2 \X = x) - 2b{x)E{V\X = x) + 6(x) 2 . 
Jv 

Assuming that E(V 2 \X = x) < oo, we can easily minimize the posterior risk by 
setting 6(x) = E{V\X = x). This result is very general. So long as the posterior 
variance of V is finite, a formal Bayes rule with squared-error loss is the posterior 
mean of V. 



3.1. Decision Problems 147 



It is possible that there exist x values such that the posterior risk given 
X = x is +00 for every decision rule. Also, it is possible that although 
there exist rules with posterior risk < oo given X = x, there is no rule 
that achieves the minimum of the posterior risk. In these cases, there is no 
formal Bayes rule as we have defined it, although there may exist x values 
such that, conditional on I = x, the posterior risk can be minimized at a 
value < oo. In this latter case, we call a rule that minimizes the posterior 
risk at all values of x for which a minimum < oo can be achieved a partial 
Bayes rule. 

Example 3.9. Suppose that {Y n }^ =1 are conditionally IID with Cauchy distri- 
bution Catt(0, 1) given 6 = 0, where ft = IR = N, V = 6, and L(0, a) = (a - 0) 2 . 
Let the prior distribution of 6 be Cau(0, 1). Let t > 0 and let Xi = min{£, Yi}. 
Define X = (Xi,X2,X3). If at least one of the Xi is strictly less than t, then 
the posterior risk will be finite for some decision rule. But if all three Xi = t, 
the posterior risk is infinite for all decision rules. In this example, a partial Bayes 
rule is any rule that chooses the action minimizing the posterior risk for those 
data in which all at least one Xi < t. As we saw in Example 3.8, the action to 
choose in those cases is the posterior mean of 0. For those data x such that the 
posterior risk is infinite (all Xi = £), it might still make sense to choose 8(x) in 
such a way that the posterior distribution of L(V, 8(x)) is stochastically small. 
That is, if we define Zs = L(V, 8(x)), we should prefer 8\ to 82 if the CDF of Zs x 
is everywhere larger than the CDF of Z& 2 . 

Example 3.10. Let H = (0, 1) = ft, and let X = (Xi, . . . , X10) where the Xi are 
IID with Ber(0) distribution given 0 = 0. Let V = G and L(0, a) = (0-0.1 -a) 2 . 
Let the prior distribution of G be Beta(l, 1) so that the posterior given X = x is 
Beta(x + 1, 11 — x). If X = x > 0 is observed, then the posterior risk is minimized 
at 8q(x) = (x-flj/H-O.l. However, if X = 0 is observed, the posterior risk is 
an increasing function of a for a G N, so we would like to choose 80 (0) as small as 
possible. But the action space is not closed, so there is no smallest possible value. 
Any decision rule 8 such that 8(x) = 8o(x) for x > 0 is a partial Bayes rule. 

If the posterior risk of a randomized rule is finite, or the loss function is 
nonnegative, we can write the posterior risk as 

r(6\x)= f [ L{v,a)dF vlx (v\x)d6(x)(a). (3.11) 

In this case, if the inner integral, h x (a) = J v L(v 1 a)dFy\x(v\x)i considered 
as a function of a for fixed x, does not achieve a minimum at some value 
of a, then it is easy to see that (3.11) does not achieve its minimum at any 
probability 8(x). This leads us to state the following result. 

Theorem 3.12. // a formal Bayes rule exists with finite posterior risk, 
then there is a nonrandomized formal Bayes rule. If at a particular value 
of x, the posterior risks of nonrandomized rules are unbounded below, then 
there exists a randomized rule with posterior risk —00 at that x (whether 
or not there is such a nonrandomized rule). 



148 Chapter 3. Decision Theory 



Proof. Before proving the first part, let h x (a) = $ v L(v,a)dF V \x(v\x), 
the posterior risk of a nonrandomized rule with 8{x) = a. Then, if a formal 
Bayes rule exists with finite posterior risk, h x {a) (as a function of a) must 
be bounded below. Furthermore, if for some x, h x (a) = oo for all a, then 
every decision rule has infinite posterior risk at x. Hence, we can assume 
that c(x) = inf a€ N h x (a) < oo for all x. 

We will prove the contrapositive of the first part of the theorem, namely 
that if, at some value of x, there is no nonrandomized formal Bayes rule, 
then the formal Bayes rule does not exist. Suppose that there is no non- 
randomized formal Bayes rule at a value x. Then there exists no b £ N such 
that h x (b) = c(x). Suppose that 6 is a randomized rule with finite poste- 
rior risk (so c(x) > -oo also.) Let A n = {a : h x (a) > c(x) + 1/n}. Since 
{a : h x (a) = c(x)} = 0, we can choose n large enough so that 6(x)(A n ) > 0. 
It follows that 

J h x (a)d6{x)(a) > c(x)6(x)(A°) + jf h x (a)d6{x)(a) 

> c(x)6(x)(A$) + S(x)(An) (c{x) + 

, x S(x)(A n ) , x 
= c(x) + v - ' > c{x). 
n 

Since c(x) = inf a6 ^ h x (a), there exists a such that 

c(x) < h x (a) < J h x (a)dS(x)(a). 

It follows that 6 is not a formal Bayes rule. 

For the second part, suppose that c(x) = inf a ei<^a:(a) = -oo. For each 
k = 1,2,.. ., let a k be such that h x (a k ) < ~2 k and for k > 1, h x {a k ) < 
h x (a k -i). Let 6{x) assign probability 2~ k to a k for each fc. Then, it is easy 
to see that 6 has posterior risk -oo even if h x (a) > -oo for all a. □ 

Although Theorem 3.12 says that there are cases in which one need only 
consider nonrandomized rules in order to find formal Bayes rules, it may 
be that there are still some randomized formal Bayes rules as well. 

Example 3.13. Suppose that P e says that {X n }£=i are IID Ber(0). Let N = 
{a 0 ,oi} and 

_ / 0 if (^ < ^ and a = a 0 ) or if (0 > \ and a = ai), 
L(0,a) - I i otherwise. 

Let X = (Xi, . . . , X n ) and suppose that n is even. Let Y = Xi ' 

Suppose that the prior is rj equal to Lebesgue measure. U Y = V successes are 
observed in n trials, the posterior is Beta(y + l,n - y + 1). The posterior risk 
for choosing a = a 0 is r(y) = Pr(9 > l/2\Y = y), and the posterior risk fo 
choosing a = a x is Pr(6 < l/2|y = y = 1 - r(y). The formal Bayes rule will 
be Z choose a 0 if r(y) < 1 - r(y) (that is, if r(y) < 1/2) and to choose a, if 



3.1. Decision Problems 149 

r(y) > 1/2. If, however, y = n/2, the posterior will be Beta(n/2 + l,n/2 4- 1), 
which is symmetric about 1/2, so r(y) = 1/2. Randomized rules of the following 
form are formal Bayes rules: 

(probability Ion ao if Y < f , 
probability Ion ai if Y > f , 
probability \ on each if Y — § . 

Next, we illustrate the case in which losses are unbounded below. 

Example 3.14. Suppose that N = 0, = 1R and that L(0,a) = -(0 - a) 2 . In 
words, we are trying to choose a as far away from G as possible. Suppose that the 
posterior distribution of O has finite variance a 2 and mean /i. Then the posterior 
risk for a nonrandomized rule 6 is — (/x — 6(x)) 2 — a 2 . So, every nonrandomized 
rule has finite posterior risk. If, for each x, S(x)(-) is a randomized rule such that 
the distribution over N has infinite variance, then 6 will have posterior risk -co. 

3.1.3 Elements of Classical Decision Theory 

In the classical paradigm, one conditions on 0 = 0 (assuming that there 
was a parametric family specified) and calculates the risk function 

R(0,6)= [ [ L(v,6(x))dPey(v)dP e (x). 
Jx Jv 

For the case in which V is not G, we can define 1 

L(0,a)= / L{v,a)dP e y{v). (3.15) 
Jv 

In either case (V = G or not), the risk function becomes 

R(0,6)= [ L{0,6{x))dP e {x). 
Jx 

There is usually no way to choose 6 to make R(0, 6) as small as possible 
for all 0 simultaneously. One possibility is to choose a probability measure 
n over and try to minimize 

r(rj,6)= / R{0,6)dr ] {0\ 

which is called the Bayes risk. Let fix denote the marginal probability mea- 
sure over namely ixx{A) = f n Pe(A)dr](0). Suppose that Pq <^v for ev- 
ery 0. If L > 0, we can use Tonelli's theorem A. 69, or if L(0, S(x))f x \e(x\0) 



Notice that a predictive decision problem has been replaced, in the classical 
setting, by a parametric decision problem with loss L(0, a), which does not depend 
on the future observable V. 



150 Chapter 3. Decision Theory 



is integrable with respect to v x 77, we can use Fubini's theorem A. 70 to 
conclude that 



Each 8 that minimizes r(n,6) is called a Bayes rule, assuming that r(rj,6) 
is finite. Otherwise no Bayes rule exists. So, we can prove the following. 

Proposition 3.16. If a Bayes rule 6 exists, then there is a partial Bayes 



3.1.4 Summary 

We now summarize the last several definitions in the case where V = 0. 

Definition 3.17. Suppose that we have a decision problem with action 
space N, parameter space f2, sample space X, and loss function L : Q, x N — ► 
ft. Let 6 be a randomized rule. Then L(0,6(x)) = j H L(0,a)d6(x)(a). The 
posterior risk of 6 is 



Let A be the set of all x for which there exists a x that achieves a finite 
minimum posterior risk. Then a decision rule such that 60 0*0 = a x for all 
x € A is called a partial Bayes rule. If A = X, then a partial Bayes rule 
is called a formal Bayes rule. The risk function of a rule 6 is i?(0,<5) = 
f x L(0,6(x))dPo(x). If 77 is a prior distribution for 9, the Bayes risk of 6 
with respect to 77 is r(r/, 6) = J Q R{0, S)dq(0). If there is a 6 that minimizes 
this quantity at a finite value, then that rule is called the Bayes rule with 
respect to 77. 

3.2 Classical Decision Theory 
3.2.1 The Role of Sufficient Statistics 

As defined, decision rules can be arbitrary functions of the data. We learned 
in Chapter 2 that all we needed from the data were sufficient statistics, 
so it should be the case that decision rules should only be functions of 
sufficient statistics. Of course, formal Bayes rules will only be functions of 
the sufficient statistic, since the posterior distribution is a function of the 
sufficient statistic. The next theorem says that if a choice of decision rules 
will be based solely on the risk functions, then even in classical statistics, 
decision rules need only be functions of sufficient statistics. 




rule that equals 6 a.s. \px\- 




3.2. Classical Decision Theory 151 

Theorem 3.18. 2 Let 6o be a randomized rule andT be a sufficient statistic. 
Then there exists a rule 6\ that is a function of the sufficient statistic and 
has the same risk function. 

Proof. For Aea, define 

6 1 (t)(A) = E(6 Q (X)(A)\T = t). 

(Since T is sufficient, this expectation does not depend on 0.) It follows 
easily that for any Sq(x) integrable function h : N — > H, 



E 



{^J h(a)d6 0 (X){a) T = t^j= J h{a)d6 1 (t)(a). (3.19) 

(Just check the equation for indicators, simple functions, nonnegative func- 
tions, then integrable functions.) Then, 

i?(Mi) = / L{6MT{x)))dP e (x) 
Jx 

= 11 L(0 1 a)d6 1 (T(x))(a)dP e (x), 
Jx Jk 

#(Mo) = f f L(0,a)d6 o (x){a)dP e (x). 
Jx Jn 

It follows from (3.19) that 

J L(0 1 a)d6 1 (T(x))(a) = E^f L(0,a)d6 o (X)(a) r = T(a;)J. 

Now, use the Law of total probability B.70 to write R{0,6i) as 

/ / L(0 1 a)d6 1 (T(x))(a)dP e (x) 
Jx Jh 

= E e (e I jf L(0, a)dS 0 (X)(a) =E e Qf L(0, a)d6 0 (X)(afj 

= f [ L(0,a)d6 o (x)(a)dP e (x) = R(0,6 O ). □ 
Jx Jx 

Note that if So in Theorem 3.18 is nonrandomized, then 6\ will still be 
randomized if T is not one-to-one. (See Problem 8 on page 209.) 
There are cases in which nonrandomized rules are all we need. 

Theorem 3.20. 3 Suppose that N is a convex subset of JR m and that for all 
0 € ft y L(0,a) is a convex function of a. Let 6 be a randomized rule and let 



2 This theorem is used in the proof of the Rao-Blackwell theorem 3.22. 
3 This theorem is used in the proof of Theorem 3.22. 



152 Chapter 3. Decision Theory 

B C X be the set of all x such that \a\dS(x)(a) < oo. Let the mean of 
the distribution 6, considered as a nonrandomized rule, be 



6q(x) = / ad6(x)(a), for x 



Then L(0, 6 0 {x)) < L(0, S(x)) for all x e B and for all 0. 

Proof. Since N is convex, Theorem B.17 says that 6q(x) e N for all x e B. 
It follows that 

L(0,6o(x)) = l(o,J adfi(x)(o)^ < j L(0,a)d6{x){a) = L{9,8(x)), 

for all x G B. The inequality follows from Jensen's inequality B.17. □ 
If B = X in Theorem 3.20, then the posterior risk for the nonrandomized 
rule So will be no larger than that for the randomized rule 6. If Pe{B) = 1 
for all 0, then the risk function of the nonrandomized rule will be no larger 
than that of the randomized rule. 

Example 3.21. Suppose that P e says that X u . . . . X n are IID Ber(0). Let X - 
(X u . . . ,X n ), N = Q = [0, 1], and L(0,a) = (0 - a)*. Let y = £" =1 x <> and set 

( % with probability \ , 
6 ( y ) = j jj±i W ith probability ±. 

This is like flipping a coin between the proportion of successes and the posterior 
mean from a (7(0, 1) prior. Then 

* (»,\ - y 4- y + 1 



2n 2n + 4' 

m6(y)) _ i(,-g 2 + i(,-iii) 2 

- '-•(i^);KS + ^)' 

cfi .(y + y±l\ + 1 -(y + y±l\ 2 

- 9 - 9 {n + 7n^) + 4 \n + n + 2j ' 

Since (x + zf/2 < x 2 + z 2 for x ? z, it follows that L(9,6 0 {y)) < L(0,6(y)). 

The theory of hypothesis testing (see Chapter 4) is one in which N is 
not a convex set, and indeed, randomized rules figure prominently in the 
classical theory of hypothesis testing. 

Theorem 3.22 (Rao-Blackwell theorem). 4 Suppose that N is a con- 
vex subset of B m and that for all d G Q, L(6,a) is a convex function 



"This theorem originated with Rao (1945) and Blackwell (1947). 



3.2. Classical Decision Theory 153 

of a. Suppose also that T is sufficient and 6 0 is nonrandomized such that 
E*(||«o(X)||) < oo. De fi ne 

«i(t) = E (*>(*) \ T = 1 )' 
Then #(0,<5i) < R(0,6 0 ) for all 0. 

Proof. Consider <5 0 as the randomized rule 6 3 (x)(A) = /a(£oO*0), for 
A e a. For A € a, let 

6 4 (t)(A) = E[« 3 (X)(i4)|T - t], « 2 (t) = / adfi 4 (0(a)- 

By Theorems 3.20 and 3.18, 

R(0,6 2 ) < R(9,6 4 ) = R(0,6 3 ) = fl(Mo). 

All that remains is to show that 6 2 = 6\. Using the law of total probabil- 
ity B.70, we can write 

6 2 {t) = / ad6 4 (t){a) = [ [ ad6 3 (x){a)dF xlT (x\t), 
Jx Jh 

where Fx\t is the conditional distribution function of X given T. Since 
6 3 (x)(-) is a point mass at <5o(x), we get 

S 2 (t) = / <5 0 (x)dF x|T (x|t) = E((5 0 (X)|r = t) = 6i(t). □ 

Example 3.23. Suppose that P e says that X u . . . , X n are IID JV(0, 1). Let X = 
(Xi,...,X„). Let N = [0, 1] and 

L(0,a) = (a-$(c-0)) 2 , 
for some fixed c € 1R. A naive decision rule is <5 0 (x) = X)" =1 h-°o t c){xi)/n. But 
T = X is sufficient and <$o is not a function of T. Since N is convex and the loss 
function is a convex function of a, we should calculate 

1 n 

E(6 0 (X)\T = t) = - ^^/(.^(XOIT = 0=Pr(Xi < c|T = t)=* 
since the distribution of Xi given T = Us JV(t, [n - l]/n). 

3.2.2 Admissibility 

The Rao-Blackwell theorem 3.22 tells us that under some conditions, the 
risk function of one decision rule is no larger than that of another. Similarly, 
Theorem 3.20 tells us that under some conditions, the loss incurred from 
one decision rule is no larger than that from another. These theorems have 
a common theme. That is, sometimes we can tell that one decision rule is 
better than another no matter what G equals. 




154 Chapter 3. Decision Theory 

Definition 3.24. A decision rule S is inadmissible if there is another deci- 
sion rule 6i such that #(0,<5i) < -R(M) for all 0 with strict inequality for 
some 0. If there is such a 6i, we say that Si dominates S. If there is no such 
6\ , then we say S is admissible. 

Example 3.25. Suppose that P 9 says that Xi, . . . , X n are IID N(u, <r 2 ), where 
0 = (/i, a). Let X = (Xi, . . . , X n ), N = [0, oo), and L(0, a) = (a - a 2 ) 2 . Define 

i=l i=l 

Then it can be shown that Si dominates S (see Problem 11 on page 210). 

The criterion of admissibility may seem too severe if some values of 0 are 
deemed to be virtually impossible. 

Definition 3.26. Let A be a measure on (ft, r) and let S be a decision rule. 
For every decision rule S u let A 6l = {0 : fl(Mi) < Suppose that 

for every decision rule S u if Jl(Mi) < a.e. [A] then X(A 6l ) = 0. 

Then S is X-admissible. 

A Bayes rule with respect to a probability measure A is A-admissible. 

Theorem 3.27. 5 Suppose that X is a probability and 6 is a Bayes rule with 
respect to A. Then S is X-admissible. 

Proof. Let Si be a decision rule. If Jl(0,«i) < a.s. [A] with strict 

inequality for all 0 6 A with X(A) > 0, then 



/ R(O,Si)dX(0)< [ R(0,6)dX(0), 
Jq Jn 



which contradicts S being a Bayes rule with respect to A. □ 
A A-admissible rule will be admissible if A is a probability that is spread 
out appropriately. Theorems 3.28, 3.29, 3.31, and 3.32 say this in different 
ways. 

Theorem 3.28. If SI is discrete, X is a probability that gives positive prob- 
ability to each element of Q, and S is Bayes with respect to A, then 6 is 
admissible. 

PROOF Suppose that Si dominates 6. Then i?(Mi) < for all 0, 

and for some 0 Q we have fl(0 o , «i) < «)• It follows that 

r(Mi) = £ A({0})JWi) < £ MW)H(M) = r(A,«), 

All 0 All 0 

since A({0 O }) > °- This contradicts that 6 is Bayes. D 



5 This theorem is used in the proof of Theorem 3.31. 



3.2. Classical Decision Theory 155 



Theorem 3.29. // every Bayes rule with respect to a prior X has the same 
risk function, then they are all admissible. 

Proof. Let 6 be a Bayes rule with respect to the prior A, and let g(9) be 
the risk function of every such Bayes rule. Suppose that 6o dominates 6. 
Then R(0,6 O ) < g{0) for all 0 with strict inequality for some 0. But then 
f Q R(0,6 o )d\(0) < J Q g(0)dX(0). Since 6 is a Bayes rule, the inequality 
must be an equality. This means that 6 0 is also a Bayes rule, hence it has 
risk function g(0), which is a contradiction. □ 
Here is an example in which the condition of Theorem 3.29 does not 
hold. 

Example 3.30. Let Q = (0,oo) and N = [0,oo). Let L(0,a) = (0 - a) 2 . Let 
X ~ £7(0,0) given 0 = 0, and let A be the U(0,c) distribution for c> 0. Then 
9|X = x has density log(c/x)J (a:>c) (0). The formal Bayes rules are of the form 

6(x) = ! < C -*>MS) if *< c > 
l arbitrary if x > c. 

Clearly, R(0, 6) for 0 > c will depend on the arbitrary part of the definition of 6. 
For example, 6 0 (x) = c for x > c will have a larger risk function than SAx) = x 
for x>c, even if 6i(x) = 6 0 (x) for x < c. 

The following theorem may apply when the parameter space is an open 
subset of H or when it is the closure of an open set. 

Theorem 3.31. Let SI be a subset of M k such that every neighborhood of 
every point in Q intersects the interior ofil. Let A be a measure on (Q, r) 
such that Lebesgue measure on {lis absolutely continuous with respect to A 
Suppose that 6 0 is \-admissible and that it has finite risk function. Suppose 
that R(0,6) is continuous in 6 for all 6 with finite risk function. Then 6 0 
is admissible. 

£?°° F : If 60 were inadmissible, then there would be 0l such that R(0, 6 2 ) < 
H(0,6o) for all 0 with strict inequality for some 0 Q . By continuity of risk 
functions, < R(e,6 Q ) for all 6 in some neighborhood N of 6 0 

which mtersects the interior of fi. Since Lebesgue measure is absolutely 
continuous with respect to A, X(N) > 0. Hence 6 0 is not A-admissible. This 
contradiction proves the result. D 
Consider an exponential family with natural parameter space SI contain- 

function. The risk function of each decision rule with finite variance for all 
9 will be continuous in 9 according to Theorem 2.64. If 8 is a Bayes rule 
with respect to a prior A that has a strictly positive density with respect to 
Lebesgue measure, then 6 is A-admissible by Theorem 3.27. Since the nat- 
ural parameter space is convex, it satisfies the conditions of Theorem 3.31 
and 6 is admissible. 

The following theorem says that with a strictly convex loss function 
every Bayes rule is admissible. 



156 Chapter 3. Decision Theory 



Theorem 3.32. Suppose that N is a convex subset of JR m and that all Pq 
are absolutely continuous with respect to each other. If L(8, •) is strictly 
convex for all 9 and 6o is X-admissible for some A, then 6q is admissible. 

Proof. If <5 0 were inadmissible, then there would be <$i such that R(6, S\) < 
i?(0, 6 0 ) for all 9 with strict inequality for some 9 0 . Define 62(0:) = [<$o(#) + 
6i(x)]/2. Then, for every 9, 

R(9M = /^L(g t ^ (g) + gl(g) )dft(x) 

< \ f[L{0, 6 0 (x)) + L(0, dxix^dPeix) 
= ±R(9,6 0 ) + \r(9J 1 )<R(9,6 0 ). 

The first inequality above will be strict unless P' e (6\(X) = 6o(X)) = 1. 
Since all Pq are absolutely continuous with respect to each other, it follows 
that = So(X)) = 1 for one 9 if and only if P^(6i(X) = 6 0 (X)) = 1 

for all 9. Hence the first inequality will be strict unless the distribution of 
Si(X) is the same as the distribution of Sq(X) given 6 = 9 for all 9. In this 
case, 6\ could not dominate 60, hence the inequality must be strict for all 
9. This would imply that 60 is not A-admissible, no matter what A is. □ 

Example 3.33. Suppose that Pe says that X\, . . . ,X n are IID Ber(9), and let 
X = (Xi, . . . , X n ). Suppose that N = [0, 1] and that the loss is L(6, a) = (0 - 
a) 2 /[6(l - 9)]. Define Y = 5Z? =1 X*, and let the prior be Lebesgue measure 
on [0, lj. The posterior given X = x would be Beta(y + l,n - y + 1), where 
y = Si=i x *- Then 

is minimized at a = y/n for all x and all n > 0. So, 6 0 (x) = y/n is a Bayes rule 
with respect to A and it is admissible by Theorem 3.32. 

Theorem 3.32 even applies to A, which put 0 mass on large portions of 
the parameter space. See Problem 17 on page 210 for examples. 

The concept of A-admissibility did not require that A be a probability 
measure. It is common to try to find "Bayes" rules with respect to non- 
probability measures. 

Definition 3.34. Let dP e /dv(x) = fx\e(x\9). Suppose that A is a measure 
on (fi, r) and that for every x there exists 6(x) such that 

/ L(9,6(x))f xl e(x\0)d\(9) = mm [ L(9,a)f X \e(x\9)d\(9). (3.35) 
The rule 6 is called a generalized Bayes rule with respect to A. 



3.2. Classical Decision Theory 157 



If 

0 < c = / fx\e(x\9)dX{9) < oo, (3.36) 
Jn 

then, after observing x, one can pretend that the "prior" distribution of 9 
has density fx\e{%\0)/ c with respect to A and that there are no data. If a 
formal Bayes rule exists in this problem, it is a generalized Bayes rule. For 
this reason, generalized Bayes rules with respect to A are also called formal 
Bayes rule with respect to A. 

Example 3.37. Suppose that P$ says that Ai, . . . , X n are IID t/(0, 9). Let X = 
(Xi, . . . , X n ), SI = (0, oo), and N = [0, oo). Let A be Lebesgue measure on (0, oo) 
and L(0,a) = (9 - a) 2 . Then 

fx\e(x\0) = O' n I iOt0) (maxxi) = 9' n I imaxXitOo) (0). 

We get c in (3.36) equal to [(n — l)(maxxi) n-1 ] _1 . So, we could invent the "prior" 
density 

(n- lXmaxsi) 71 - 1 T 

The Bayes rule with respect to this prior is the mean, which is S(x) = (n — 
1) max Xi/(n — 2). This is a generalized Bayes rule. 

It sometimes happens that the integral with respect to A of the risk 
function of a generalized Bayes rule with respect to A is finite. In such 
cases, there is an analog to Theorem 3.31. 

Theorem 3.38. If Q is a subset of IR k such that every neighborhood of 
every point in SI intersects the interior of SI, R(Q, 8) is continuous in 6 for 
all 6, Lebesgue measure on SI is absolutely continuous with respect to X, So 
is a generalized Bayes rule with respect to A, and L(9, 6q{x)) f x\q{x\0) is 
v x A integrable, then So is X-admissible and admissible. 

Proof. All we need to show is that So is A-admissible and then apply The- 
orem 3.31. For each decision rule (5, R{9, S) — f x L(9, 8{x))fx\e{x\9)dv(x). 
If L(9,8(x))fx\e{x\9) is u x A integrable, then 

/ R(9,6)dX(0) =11 L(9,6(x))f X \e(x\O)dv(x)dX(0) 

L{9 ) S{x))f X \e{x\9)dX{9)du{x), 



II 

Jx Jn 



fx Jn 

where the last equality follows from Fubini's theorem A.70. If <5i is any 
other rule, then 

/ L(eM*))fx\s(x\o)d\{e) < f L(e,6 1 (x))f x{e (x\e)dx(e), 

Jn Jn 
for all x, since <5o is a generalized Bayes rule with respect to A. Hence, 



/ R(9,8 0 )dX(9) < [ R(9,S l )dX(9). 
Jn Jn 



158 Chapter 3. Decision Theory 



So, it cannot be the case that R(9,6\) < R(9,6q) for all 9 with strict 
inequality for 9 £ A with \{A) > 0. □ 

Example 3.39. Suppose that Xi, . . . ,X„ are IID Ber(6) given 6 = 0, and let 
X = (Xi, . . . ,X„). Let N = [0, 1] and let the loss be L(0,a) = (0 - a) 2 . Define 
V = X^iLi anc * ^ nave Radon-Nikodym derivative 1/ [0(1-0)] with respect 
to Lebesgue measure on (0, 1). The posterior given X = x would be Beta(y, n-y), 
where y = X^=i Xi un l ess y = 0 or y = n. For 1 < y < n - 1, the generalized 
Bayes rule is 6(x) = y/n. For y = 0 or y = n, the only values of 6(x) which make 
(3.35) finite are 6(x) = y/n. So 6 is a generalized Bayes rule with respect to A. 
Now L(6,6{x))fxie{x\0) = {0 - y/n) 2 0 y {l - 0) n ~ y , which has integral 1/n with 
respect to counting measure times A. Hence, 6 is A-admissible and admissible. 

Sometimes the integral of the risk function with respect to an infinite 
measure is not finite. For this reason, Blyth (1951) proved the following 
theorem, which makes use of a sequence of generalized Bayes rules to con- 
clude that a rule is admissible. 

Theorem 3. 40. 6 Let 6 be a decision rule. Let {^n}??=i be a sequence of 
measures on (fi, r) such that a generalized Bayes rule 6 n with respect to X n 
exists for every n with 

r{\ n ,6n) = J fl(Mn)dA n (0), 

lim r(An,«)-r(A ni tf n ) = 0. (3.41) 

n— >oo 

Suppose that either of the following conditions holds: 

• All Pq are absolutely continuous with respect to each other; N is a 
convex set; L(9,a) is strictly convex in a for all 9; and there exist c, 
a set C, and a measure A such that A n < A and d\ n /d\(9) > c for 
9eC with A(C) > 0. 

• Every neighborhood of every point in intersects the interior ofQ, for 
every open subset C C Q there exists a number c such that A n (C) > c 
for all n, and the risk function of every decision rule is continuous 
in 9. 

Then 6 is admissible. 

Proof. Suppose that 6 is inadmissible. Then there is 6 f such that R(9, 6') < 
R(9,8) for all 9 and R(9 0j 6') < R(9 0 ,6). 

If the first condition holds, set 6" = (<5 + <5')/2 and get that L(9,6"(x)) < 
[L(0,6{x)) + L(0,S'{x))]/2 for all 9 and all x for which 6(x) ^ S'(x). Since 
p' d (5(X) = 6'(X))<1 and all Pq are absolutely continuous with respect to 



6 This theorem is used in Example 3.43 and in the proof of Theorem 3.44. 



3.2. Classical Decision Theory 159 



each other, we have P' e (6{X) = 6'(X)) < 1 for all 0 and R(0,6") < R(9,6) 
for all 0. So, for each n, 

r(X n ,6) - r(X n , 8 n ) > r(X n ,6) - r(X n ,6") > f [R(e, 6) - R(6, 6")}dX n (0) 

Jc 

> c f [R(0, 6) - R(0, 6")}d\{6) > 0. 
Jc 

This contradicts (3.41). 

If the second condition holds, there exists e > 0 and open C C fi such 
that R(0, 6') < R(0, 6) - e for all 0 e C. Now note that for each n, 

r(X n ,6)-r{X n ,6 n ) > r{X n ,8) - r(\ n ,6') 

> f [R(0, 8) - R(6, 6')}d\ n (0) > e\ n (C) > ec, 
Jc 

where c is guaranteed in the second condition. This contradicts (3.41). □ 
We can use Theorem 3.40 together with the following lemma to prove 
that some common estimators are admissible. 

Lemma 3.42. 7 Suppose that 6 = (61,62). Also, suppose that, for each 
possible value 9 2fi o/6 2 , <5 is admissible when the parameter space is Q 0 = 
{0 = (0i,02,o) e ft}. Then 6 is admissible. 

PROOF. Suppose that 8 were inadmissible. Then there exists 8* such that 
R(0,6*) < R{0,6) for all 0 and R{9 0 ,8*) < R(0 O ,6) for some 0 O € SI. 
Let 0 O = (0 hO ,02,o), and let fi 0 = {0 = {9 1 ,0 2 ,o) € fl}. We now have a 
contradiction to «'s being admissible when the parameter space is fl 0 . □ 

Example 3 43 Suppose that P 9 says that X has Nfao 3 ) distribution, where 
admissible ( } = ^ ~ a) We n0W prove that 6 ^ = x is 

E, we will show that ,s admissible for the parameter space fi„ = {(/*,<t 0 ) : (i € 
K \ n £ e the meaSU1 : e ° n 00 havin S densit y times the JV(0,agn). The 
generated Bayes rule with respect to A n is 6 n {x) = nx/(n + 1). The integral of 
he risk funct on oi 6 n with respect to X n is r n = n 3 / 2 <r 0 2 /(n + 1). The integral of 
the nsk function of 6 with respect to A n is ^a 2 0 . Note that 8 

V^ 0 2 - ^ = v^o 
n+1 n+1' 

which goes to 0 as n 00. If C is an open subset of n 0 , then A„(C) > Ai(C) 

aorf in SIT 18 S f 1Ctly C °o eX in a f ° r a " °> the conditions ° f Theorem 3.40 
B?Lem m « c,^ rame ^/P^po, so « is admissible with this parameter space. 
By Lemma 3.42, it is admissible for the entire parameter space. 

7 This lemma is used in Example 3.43. 



160 Chapter 3. Decision Theory 



The following theorem is a simplified version of a theorem of Brown and 
Hwang (1982). It will allow us to extend Example 3.43 to two dimensions. 

Theorem 3.44. Suppose that X has a k- dimensional exponential family 
distribution given 0 = 0 with density fx\ei x \0) = c(6)exp(0 T x) with re- 
sped to a measure v. Let the natural parameter space ft be a rectangular 
region I\ x • • • x Jfc. Let U = (ai t i,a2,i) with aj^ possibly infinite. Suppose 
that lim^_, 0j . ^ fx\e(x\@) = 0 for all i,j,x. Suppose that there exist a set 
SCO, with positive Lebesgue measure and a sequence of almost everywhere 
differentiate functions {h n }£Li such that 

• h n : ft -> [0,1], 

• h n (0) = 1 ifOeS, 

• lim^oo ||V<^An(0)|| = 0 for all 9 (where V denotes the gradient), 



. / nS up n ||V > /MW0<oo. 
Then 6(x) = x is admissible as an estimator of g(0) — Ee(X) with loss 
L(d,a) = Zl 1 (9i(0)-a i ) 2 . 

PROOF. Let A n be the measure on (0,r) with Radon-Nikodym deriva- 
tive h n (6) with respect to Lebesgue measure (A). Then d\ n /d\(6) — 1 
for all 9 £ S with A(S) > 0 and the loss is strictly convex in a, so the 
first of the two alternative conditions of Theorem 3.40 is met. We need 
to find the generalized Bayes rule 6 n with respect to A„ and show that 
lim n _ 00 r(A n ,6) - r(A n A) = 0. Since g(9) = -Vlogc(fl) by Proposi- 
tion 2.70, the generalized Bayes rule with respect to A„ will be 

J n (-v\ogc(d))c(e)exp(e r x)h n (Q)dd 
- j 0 c(e)ex P (e T x)h n (9)de 

f n (-Vc(e))exp(6 T x)h n (e)dO 

J Q c(9)exp(0 T x)h n (e)de 
J n c{9) exp{9 T x)[xh n {9) + Vh n (9)}d9 
J sl c(9)exp(9 r x)h n (9)d6 
UVh n (9))f xie (x\9)d9 
+ S Q fx\e(x\0)h n (9)d9 ' 

where the third equality follows by doing integration by parts with respect 
to 9i for the ith coordinate of the integral (with u = exp(9 T x)h n (0) and 
dv = -dc(9)/d9i) and using lim^-^,, fx\e{x\0) = 0 to dro P the integrated 
term uv = fx\e( x \0))- Now write 

r(X n ,6) -r(X n ,6 n ) 

= 11 (\\6(x)-9(9)\\ 2 -W^{x)-9(e)\\ 2 )fx\e(x\0)h n (e)du(x)dX(9) 



- // 

Jx Jn 



3.2. Classical Decision Theory 161 
(x - 6 n (x)) J (x + 6 n (x) - 2g(0))f xle (x\9)h n (9)d\{0)du{x). 

fXJQ 

According to (3.45), we have that 

/ 9(O)fx\e(x\O)h n (0)d\(6) = 6 n (x) [ f x{e (x\0)h n (0)d0. 
For convenience, define 

H n (x) = / fx\e(x\0)h n (0)d0, and J n (x) = / (Vh n (0))f x{e (x\0)d0. 

JQ JQ 

Then x - 6 n (x) = -J n (x)/H n (x) and 

r(A n ,<5)-r(A n ,6 n ) = / l^t dv{x y 
Jx H n (x) 

Use the fact that Vh n {0) = 2y/h n (0)Vy/h n (0) and the Cauchy-Schwarz 
inequality B.19 to conclude that 

||Jn(x)|| 2 <4if n (x) / \\Vy/JU0)\\ 2 f X]e (x\0)d0. 
JQ 

This means that 

r(A n ,<5)-r(A n A) < ±1 I \\^\f^)\?fx\s{x\0)d0dv{x) 

JX JQ 

= 4 / IIVv'M^II 2 ^, 

which goes to 0 as n — > oo because of the last two conditions in the theorem 
and the dominated convergence theorem A. 57. □ 

Example 3.46. Suppose that X ~ N 2 (^ a) given 0 = a) where cr is a 2 x 2 
positive definite diagonal matrix and fi is two-dimensional. 8 Let N = H 2 , let 
L(0, a) = (ai - /xi) 2 + (a2 - M2) 2 , and let = x. For each value 



cr 0 



V 0 cr 2 ,, 2 J ' 



let fio = {0 = (/i, cro) : /1 E IR 2 } be a subparameter space. The natural parameter 
of this exponential family is \\) = (/ii/tfo,!,/^/^ 2 ^), where <7o,i and <7o >2 are the 
diagonal elements of ao. Consider the following sequence of functions: 

f 1 2 if iivii 2 < 1, 

MV>)={ (l-^ff) 2 ifl<W<n, 
I 0 if HVII > n. 



8 This example was compiled from material in Brown and Hwang (1982) and 
Section 8.9 of Berger (1985). 



162 Chapter 3. Decision Theory 



Here S = {0 : ||^|| < 1} in Theorem 3.44. It is easy to see that 



\\y/h^)\\ 2 = (||^||log(n))- 2 / (1 , n] (||^||) 

< (||^||log(max{|^||,2}))- 2 / [1 , oo) (|i^||). 



It follows that Hindoo \\^\/hnW\\ = 0 for all tp. To verify the last condition of 
Theorem 3.44, we need only show that 



By transforming to polar coordinates, we see that this integral is a finite constant 
plus a constant times 



The following result allows us to translate admissibility in one decision 
problem to admissibility in a different decision problem if the loss functions 
are related. 

Proposition 3.47. 9 

• Let Q, be an open subset of IR k . If c(0) > 0 for all 0 G ft, then 6 
is admissible with loss L(0,a) if and only if 6 is admissible with loss 



• Let Q, C JR be an interval. Suppose that c{6) > 0 for all 0 and is 
strictly positive except for 0 € A, where A consists solely of points 
in Q which are isolated from each other. Suppose also that, with loss 
function L(0, a), the risk function of every decision rule is continuous 
from the left at every 0 6 A. (Alternatively , suppose that all risk 
functions are continuous from the right at every 0 e A.) Then 6 is 
admissible with loss L(9, a) if and only if 6 is admissible with loss 
c(0)L(0,a). 

• Let d(0) be a real-valued function of0. Then 6 is admissible with loss 
L(0,a) if and only if it is admissible with loss L(0,a) + d(0). 

The proof of this proposition is simple and is left for the reader. 

Example 3.48 (Continuation of Example 3.33; see page 156). Let c(0) = 0(1-0) 
and ft = (0, 1). By Proposition 3.47, 6 0 (x) = y/n is admissible with loss L(0, a) = 
(0 - a) 2 . Since c and all risk functions are continuous, even at the endpoints, it 
is easy to show that So is also admissible if ft = [0, 1]. 

Proposition 3.49. Suppose that 6 0 is \-admissible with loss L(0,o). Then 
6 0 is \-admissible with loss c(0)L(0,a) if c(0) > 0 a.e. [A]. 





c(0)L(0,a). 



9 This proposition is used in Examples 3.48 and 3.59. It is also used to simplify 
the class of loss functions in hypothesis testing in Chapter 4. 



3.2. Classical Decision Theory 163 

3.2.3 James-Stein Estimators 

We have seen some examples of simple decision rules that are admissible, 
but there is a notorious example of a simple decision rule that is inadmis- 
sible. This example has spawned a great amount of study. It begins with 
Stein (1956) and James and Stein (1960), who showed the following. 

Theorem 3.50. Suppose that the conditional distribution o/Xi,...,X n 
given 6 = (/ii, . . . , n n ) is that they are independent with X{ ~ iV(/Xi, 1). Let 
X = {X u . . . ,X n ), N = ft = IR n , and let the loss be L{0,a) = ~ 
ai) 2 . Then, ifn > 2, 6(x) = x is inadmissible. In fact, a rule that dominates 
6 is 

n-2 



Si(x) = 6(x) 



The proof we give here requires a few lemmas. The first is due to Stein 
(1981). 

Lemma 3.51. Let g : 2R — > ]R be a differentiate function with derivative 
g'. Suppose that X has iV(/x, 1) distribution and E(\g'(X)\) < oo. Then 
WX)) = Cm(X,g(X)). * 

Proof. Let 0(t) = exp(-t 2 /2)/\/2?r be the standard normal density func- 
tion. Use integration by parts to show that 

<t>(x - fx) = / ( z - n)<f>(z - n)dz = - / (z- ii)<t>{z - fi)dz. 

Jx J-oo 

We will use these facts in what follows. 

E(g'(X)) = f°° g'(x)ct)(x-ii)dx 

J —oo 

f°° p0 

= / 9'(x)<t>(x-fi)dx+ / g'{x)(j){x-ix)dx 

poo poo Q 

= J 9'{x) J (z- n)<j>{z - n)dzdx - f g'(x) f (z - n)<j>{z - »)dzdx 

x J — oo J —oo 

f°° r* f o r o 

-J (z-fi)<f>(z-n) g'(x)dxdz- / {z-n)4>{z-n) / g'(x)dxdz 

J0 J-oo J z 

pOO Q 

= / (z- tx)<j>{z - (i)[g(z) - g(0)}dz - f (z - n)<t>(z - /x)[ 5 (0) - g(z))dz 

J —oo 

/oo 
W) - 9(0)]{z - M )0(z - fi)dz 
-OO 

/OO 
9(z)(z - n)<p(z - n)dz = Cov(X, g(X)). □ 
-oo 



164 Chapter 3. Decision Theory 



Lemma 3.52. Let g : 2R n — ► IR n be a vector of differentiable functions, 
• • • ,<?n). Let X have N n (6,I) distribution. For each i, define hi(y) to 
be the expected value ofgi(X u ..., X^i , y, X i+X , . . . , X n ) and 

h\{y) = E^iCXx, . . . , Xi-x^Xi+u • • • ,X n ). (3.53) 
Suppose that, for all i f E(|/iJ(Xi)|) < oo. T/ien 

n 

9Xi 



E||X + - 0|| 2 = n + E | || 5 (X)|| 2 + 2 £ 



1=1 



x=X, 



Proof. Write 

E\\X + g(X)-0\\ 2 = E\\X-0\\ 2 + E\\g(X)\\ 2 + 2E[(X-e) T g(X)} 

n 

= n + E\\g{X)f + 2EY,l(Xi - 

i=l 

All we need to prove is that for each i, 

E[(X i -9 i )g i (X)}=E^g i (x) J. 

By integrating out the X j for j '• ^ i, the left-hand side can be written as 
E[(Xi - 0i)9i(X)} = E[(Xi - 9i)hi(Xi)] 

= Cov(XiMXi)) = E(K(Xi)) = E^ 9i (x) J , 

where the first equality follows from the definition of /i*, the third follows 
from Lemma 3.51, and the fourth follows from (3.53) and then integrating 
out Xi. ^ □ 

Proof of Theorem 3.50. Now, let g(x) = -x(n - 2)/ Y%=i x j> an< ^ use 
the notation of Lemma 3.52. This makes gi{x) = — (n - 2)#i/ ]T^ =1 a^. 
For each x ^ 0, the second partial derivative of & with respect to the ith 
coordinate of x is uniformly bounded in a neighborhood of X{ for each set 
of values of xj for j ^ i. A simple application of Taylor's theorem C.l with 
remainder shows that hi can be differentiated under the expectation. We 
can write 

E e (\K(Xi)\)<(n-2) I ■ / 1 J ~* ' a 7 x|e (slflcfei • 



^(X i )|)<(n-2)/.../i-? 



This can be bounded by three times the expected value of one over a Xn 
random variable. The expected value of one over a xl random variable is 



3.2. Classical Decision Theory 165 



n/(n - 2) if n > 2, hence Ee(Wi(*)\) < °°- Lemma 352 says that the " sk 
function for is 



d 



UffPOH 2 + 2 £^<7i(z) 



*=1 



x=X 



We can write 



|| 5 (x)|| 2 = (n-2)'- 



2 £k*f _(»- 2 ) 2 



1=1 

ii P ( a; )ii 2 +2x:A 9i(:r ) 



= -(n-2) 



-(n-2) 2 



(3.54) 



for all x. It follows that the risk function is less than n for all 0. □ 
From (3.54), in order to calculate the risk for <$i, we need the mean 
of l/£j=i*j- Note that Z = E?=i X j has noncentral x 2 distribution, 
^CXn(A) with A = £* =1 /x 2 . F™ 111 the form of the NC X 2 density, it is 
clear that Z has the same distribution as Y, where Y ~ Xn+2fc & ven K = k 
and If ~ Poi(A). The mean of l/Z is 

Notice that when A = 0, K = 0, a.s. and il(0,6i) = 2. This is where the 
risk function is smallest. A plot of R(d, 6\) as a function of A for n = 6 is 
given in Figure 3.56. There is no reason why the smallest value of the risk 
function must occur when 0 = 0. We could subtract a vector 0o from X 
and then add 0o back on to 6\ to get an estimator that has the minimum 
of its risk function at 0q. This would give the decision rule 



6 2 {X) = 0 0 + 6 l (X -0 O ). 



(3.55) 



It may be that we cannot decide which vector 0o to subtract. It is possible 
to choose based on the data. If n > 4, then we could use the decision rule 



where 1 denotes a vector whose coordinates are all 1. (See Problem 20 on 
page 211.) 



166 Chapter 3. Decision Theory 




0 1 2 3 4 5 6 

X 

Figure 3.56. Risk Function of James-Stein Estimator for n = 6 

There is a way to derive the James-Stein estimator from an empirical 
Bayes argument. 10 This was done by Efron and Morris (1975). Suppose 
that 6 ~ N n (8 0 ,T 2 I). The Bayes estimate for 6 is 

Oo + ix-eo)^. 

The empirical Bayes approach tries to estimate r from the marginal dis- 
tribution of X . The marginal distribution of X is N n (6o,(l + t 2 )I). So, 
we could estimate 1 + r 2 by Y?i=\(Xi - #oi) 2 / c f° r some c. An estimate of 
r 2 /(r 2 + l)is 

i £ 

£r=i(*i-0oi) 2- 

The empirical Bayes estimator is then ^(X) if c = n - 2. If we take the 
empirical Bayes approach one step further and also try to estimate #o> 
we could use XI as an estimate, and the estimate of 1 4- r 2 would be 
- X) 2 /c. With c = n - 3, we get 6 3 (X). 
Another way to arrive at estimators like these is through hierarchi- 
cal models (to be discussed in more detail in Chapter 8). For example, 
0i, . . . ,6 n could be modeled as conditionally IID iV(/i,r 2 ) given M = /i 
and T = t. Then M and T could have some distribution, rather than merely 
being estimated as in the empirical Bayes approach. Strawderman (1971) 
finds a class of Bayes rules that dominate So when p > 5 and are admissible 
by Theorems 3.27 and 3.32. 



See Section 8.4 for more detail on empirical Bayes analysis. 



3.2. Classical Decision Theory 



167 



The estimator 6i(X) is actually inadmissible as can be shown in Prob- 
lem 22 on page 211. Brown (1971) considers the problem of rinding nec- 
essary and sufficient conditions for an estimator to be admissible in this 
setting. 

3.2.4 Minimax Rules 

There are usually lots of admissible rules. Unless one is willing to choose 
one by choosing a Bayes rule with respect to some prior distribution, then 
one needs some other criterion by which to choose a rule. One such criterion 
is the minimax principle. 

The Minimax Principle: In comparing rules, the rule S with 
the smallest value of sup# R(0, S) is best. 

The minimax principle says to prepare for the worst possible value 6 of 9. 
When playing a game against an opponent who is trying to make things 
bad for you, there may be good reason to prepare for the worst. When it 
makes sense to consider how likely are various alternative value of G, the 
worst value may turn out not to be of such a concern. 

Definition 3.57. A rule <5o is called minimax if, for all 6, sup^Q R(6, So) < 
s\ip een R(0, S), alternatively, sup deQ R(0, So) = inf* sup eea R(6, S). 

Proposition 3.58. // S has constant risk and it is admissible, then it is 
minimax. 

Example 3.59. We saw earlier that when X ~ N(fi, a 2 ) given 6 = (/i, cr), S(x) = 
x is admissible when the loss function was L(d, a) = (u-a) 2 . By Proposition 3.47, 
it is also admissible with loss 1/(0, a) = (p, — a) 2 /a . The risk function for this 
new loss is constant R(0, 6) = 1/n. Hence 6 is minimax with loss L f . Every other 
decision rule will have to have a risk function that approaches or surpasses 1/n 
for some 0 values. 

Theorem 3.60. Let {An}^! be a sequence of probability measures on the 
parameter space (fl, r) with 6 n being the Bayes rule with respect to X n . 
Suppose that lim n _>oo r(\ n , 6 n ) = c < oo. // there is <$o such that R(0, 6o) < 
c for all 0, then So is minimax. 

Proof. Assume So is not minimax. Then there is 6' and e > 0 such that 
R(0, <5 ; ) < sup<f, e nR((t>, So) - e < c - e, 

for all 9. Choose no so that r(A n ,£ n ) > c — e/2, for all n > no- Then, for 
n > no, 




□ 



168 Chapter 3. Decision Theory 



Example 3.61. Suppose that Pe says that Xi,...,X m are independent with 
Xi ~ N(0i, 1), where 9 = (0i, . . . ,0 m ). Let S 0 {X) = X, N = H m , and L(0,a) = 
~~ a *) 2 - Let A n be the probability measure that says 9 has distribution 
7V m (0,nJ). The Bayes rules are 6 n (X) = nX/(n + 1), and the Bayes risks are 
r(\ n ,S n ) = mn/(n + 1), which go to m as n — ► oo. Also, .R(0,<5o) = so 6o is 
minimax. We see that minimax rules need not be admissible. 

Example 3.62. Suppose that P e says that X ~ Bin(n,0) and that L(0,a) = 
((9 - a) 2 /[0(l - 0)], where ft = (0, 1) and N = [0, 1]. A rule with constant risk is 
So(X) = X/n. The risk is R(0,So) = 1/n. We saw earlier that So is admissible, 
so it is minimax. 

Now, suppose that we use the loss 1/(0, a) = (0 - a) 2 . We will see that no 
analog to Proposition 3.47 operates here. If the prior for 0 is Beta(a,f3), then 
the Bayes rule is 6(x) = (a 4- x)/(a 4- j3 + n). The risk function for this rule is 

R & = Q + ^ + n K [(* + /^) 2 " n] + 0 [n - 2a (a + /?)] + a 2 } , 

which is constant if a = /3 = The constant risk is 1/(4 + 8>/ri 4- 4n). So <5 

is minimax. The rule can be expressed as S(x) = (x + >/n/2)/(n + y^)- Notice 
that this is like changing your prior distribution as the sample size changes. 

Since minimax rules are designed to prepare for the worst, we should see 
if there are prior distributions that make the worst 6 just likely enough so 
that the corresponding Bayes rules are minimax. 11 

Definition 3.63. A prior distribution Ao for 9 is least favorable if 
infr(Ao,<5) = supinf r(A,6). 

6 x 6 

Such a Ao is sometimes called a maximin strategy for nature. 

Let Ao and <$o be a fixed probability and decision rule, respectively. It is 
true that inf $ r(A 0 , 6) < sup A r(A, 6 0 ). So, it follows that 

V = sup inf r (A, 6) < inf sup r(A, S) = inf sup i?(0, 6) = V. 
x <s ^ a 6 e 

Definition 3.64. The number V above is called the maximin value of the 
decision problem, and the number V is called the minimax value. 

Theorem 3.65. If do is Bayes with respect to Ao and #(0,<$o) < r(Ao,<$o) 
for all 0, then So is minimax and Aq is least favorable. 



11 There is a delicate balance between how likely the bad 0 values are and how 
likely the rest of the parameter space is. In the second part of Example 3.62, 
the worst 0 values, in some sense, are those near 1/2 because the data are most 
variable given G = 1/2. The prior Beta(y/n/2, y/n/2) puts enough mass near 1/2 
to force us to take seriously the possibility that the data will be highly variable. 
However, it still spreads enough mass around the remainder of the parameter 
space so that we cannot ignore other 0 values. If the prior put probability 1 on 
9 = 1/2, for example, then the Bayes rule would be S(x) = 1/2 for all x. 



3.2. Classical Decision Theory 169 



Proof. Since V < sup^ R(9,6 0 ) < r(A 0 ,<5 0 ) = inf*r(A 0 ,<5) < V < V, it 
follows that all of the inequalities are equalities. □ 

Example 3.66 (Continuation of Example 3.62; see page 168). We saw that the 
minimax rule with loss (0 — a) 2 was Bayes with respect to the Beta(y/n/2, y/n/2) 
prior, Ao- Since the risk function is constant, R(0,6) = r(Ao, 6) for all 6. It follows 
that Ao is least favorable. The reason it is least favorable is that it puts a lot of 
mass on the 0 values (near 1/2) that have high variance for X. On the other 
hand, it does not put so much mass there that the estimator is drawn too close 
to 1/2. 

Definition 3.67. A rule 6q is extended Bayes if for every e > 0 there exists 
a prior A e such that r(A e ,6 0 ) < e + inf<5 r(A e ,<5). 

Example 3.68. Suppose that P e says that X ~ JV(0, 1). Let L(0,a) = (6 - a) 2 , 
and let A £ be the JV(0, (1 - c)/c) prior for 6. Let 6 0 (x) = x. The Bayes rule with 
respect to A e is 

1 ~ t ~ 1-C 

The Bayes risk is r(A e ,<5 e ) = 1 - e, while the Bayes risk of 6 0 is r(A e ,£ 0 ) = 1, 
which is no greater than e + 1 - e. So <5 0 is extended Bayes. 

Theorem 3.69. A constant-risk extended Bayes rule is minimax. 

PROOF. Let 6 0 be a constant-risk extended Bayes rule. Let R(0, 6 0 ) = c for 
all 9. Suppose that 6 0 is not minimax, but rather that there is a rule <$i such 
that sup*, JR(0, Si) = c - c, for some e > 0. Let X e/2 be as in Definition 3.67. 
That is, 

r(Aj,* 0 ) < | + infr(A f ,(5). 
Since 6 0 has constant risk c, its Bayes risk is also c. So 

c<|+infr(Ae,<5). 

The Bayes risk of <5i can be no greater than c - e, so inf 6 r(A c/2 , S) < c - e 
It follows that c < c/2 4- c - c = c - e/2, which is a contradiction. " □ 

Example 3.70. Suppose that P e says that X ~ Poz(^), where f2 = (0,oo) and 
= L ^ the loss function be L(0,a) = - a)V^, and let 6 0 = x. The 

risk function is R(0 6 Q ) = 1, constant. Let A e be the prior that says 6 has 
1 (l,e/[l - e]) distribution. The posterior distribution is 

e|x = *~r(i + i lT i.). 

The Bayes rule is 6 € {x) = x(l - e), with Bayes risk 1 - e. Since the Bayes risk of 
Oo is 1, Oo is extended Bayes, hence minimax. 

There are certain situations in which minimax rules are known to exist. 
These involve finite parameter spaces. When ft is finite, the risk function 



170 Chapter 3. Decision Theory 

is just a vector in some Euclidean space. The set of all risk functions of all 
decision rules is just a convex set of vectors. 12 

Definition 3.71. Suppose that fi = {0i, . . . ,0 k }. Let 

R = {z e JR k : Z{ = R(6i, <5), for some decision rule 6 and i = 1, . . . , k}. 

We call R, the risk set. The lower boundary of set C C ]R k is the set 

{z e C : Xi < Zi for all i and Xi < Zi for some i implies x £ C}. 

The lower boundary of the risk set is denoted d L . The risk set is closed 
from below if Dl £ R. 

Example 3.72. Consider a situation with Q = {0, 1} and N = {1,2,3}. Let the 
loss function be 



L(fl,a) 







a 




e 


i 


2 


3 


0 


0 


1 


0.5 


i 


i 


0 


0.2 



Supposing that no data are available, the class of randomized rules consists of 
the set of all probability distributions over the action space N. The risk function 
for a randomized rule with probabilities (pi,P2,P3) for the three actions (1,2,3) 
is just the point (p2 + 0.5p3,pi 4- 0.2p3). The set of all such points is the shaded 
region in Figure 3.73. 

We can locate the minimax rule in Figure 3.73 by looking at all orthants of the 
form Os = {(xo, xi) : xo < s, xi < s} and finding the one with the smallest s that 
intersects the risk set. These orthants are shown in Figure 3.73. For all s < 0.3846, 
the orthant O s fails to intersect the risk set. But Oo.3846 does intersect the risk set 
at the point (0.3846, 0.3846). This point corresponds to the randomized rule with 
probabilities (0.2308,0,0.7692) on the three available actions. It is interesting to 
note that one is required to randomize in order to achieve the minimax rule. 
This is somewhat disconcerting for the following reason (among others). After 
performing the randomization, one will then either choose action a — 1 or action 
a = 3. In either case, one is no longer using the minimax rule, and the risk point 
for the chosen decision is either (0,1) or (0.5,0.2), but not (0.3846,0.3846) as 
hoped. 

Two lines are added to Figure 3.73 to show the Bayes rules with respect to 
two different priors. The line 0.5x 0 + 0.5xi = 0.35 passes through the point 
(0.5,0.2) to indicate that the action a = 3 is Bayes with respect to the prior 
that puts equal probability on each parameter value. (The action a = 3 is also 
Bayes with respect to many other priors.) The prior with Pr(6 = 0) = .6154 
is least favorable, and the line 0.6154x 0 + 0.3845xi = 0.3846 passes through all 
of the points corresponding to Bayes rules with respect to the least favorable 
distribution. (See Problem 25 on page 211 to see how the minimax principle is 
actually in conflict with the expected loss principle in this example.) 



12 The remainder of this section is devoted to proving the minimax theo- 
rem 3.77. The discussion of risk sets is used in the proofs of the minimax the- 
orem 3.77 and the complete class theorem 3.95. It is also used briefly in the 
discussion of simple hypotheses and simple alternatives in Chapter 4. All of this 
material can be skipped without disrupting the flow of the remaining material. 



3.2. Classical Decision Theory 171 




Figure 3.73. Risk Set for Example 3.72 

Notice that the risk set in Figure 3.73 is convex. This is true in general. 
Lemma 3.74. The risk set is convex. 

Proof. Let Zi be a point in the risk set that corresponds to a decision rule 
8i for i = 1, 2, and let 0 < a < 1. Then az\ + (1 - a) z^ corresponds to the 
randomized decision rule a#i + (1 — a) 62. □ 
There is a common misconception that a minimax rule can be located 
by finding that point in the risk set with all coordinates equal which lies 
closest to the origin. Here is an example in which the unique minimax rule 
corresponds to a point with distinct coordinates. 

Example 3.75. Consider a situation with Q = {0, 1} and N = {1, 2,3}. Let the 
loss function be 



L(fl,q) 







a 




0 


1 


2 


3 


0 


0 


0.25 


1 


1 


1 


0.5 


0.75 



The class of randomized rules consists of the set of all probability distributions 
over the action space N. The risk function for a randomized rule with probabilities 
(pi>P2,P3) for the three actions (1, 2, 3) is just the point (0.25p 2 +P3,Pi +0.5p 2 + 
0.75p3). The risk set is illustrated in Figure 3.76 together with the point corre- 
sponding to the unique minimax rule, choose action 2. The point (0.625,0.625) 
is the closest point to the origin which has equal coordinates. 

The following theorem gives conditions under which minimax rules and 
least favorable distributions exist. 




Figure 3.76. Minimax Rule with Unequal Risks 



Theorem 3.77 (Minimax theorem). Suppose that the loss function is 
bounded below and ft is finite. Then sup A inf6r(A,£) = mfss\xp e R(0,6), 
and there exists a least favorable distribution Ao- If R is closed from below, 
then there is a minimax rule that is a Bayes rule with respect to Xq. 

The proof requires a few lemmas. 

Lemma 3.78. Suppose that ft is finite. The loss function is bounded below 
if and only if the risk set is bounded below. 

Proof. Suppose that the loss function is bounded below. That is, there 
exists c such that L(0, a) > c for all 6 and a. Then i?(0, 6) > c for all 0 
and 6, since the risk function is just the integral of the loss function with 
respect to a probability measure. 

Suppose that the loss function is unbounded below, that is, there exists 
a sequence {(0 n ,a n )}~ =1 such that L(0 n ,a n ) < -n for each n. Since ft is 
finite, there exists a single 0 and a sequence {6 n }^i such that L{9,b n ) < 
-n for all n. With 6 n (x) = b n for all x, we have R(0,6 n ) < -n, hence the 
risk set is not bounded below. n 

Lemma 3.79. // a set C C IR k is bounded below, then its lower boundary 
is nonempty. 

PROOF. First, note that the lower boundary of C is the same as the lower 
boundary of C, the closure of C. Next, let c x = inf {*i : z € C}, and for 
i = 2, . . . , fc, let 

d = {z e C : Zj ;= Cj, for j = 1, . . . ,i - 1}, 



3.2. Classical Decision Theory 173 



ci = inf{zi : z € Ci}. 

Since C is closed, the point c = (ci, . . . ,Cfc) G C. Suppose that the lower 
boundary is empty. Then, there is a point x such that Xi < Ci for all i with 
at least one inequality strict. This is clearly a contradiction to the way that 
c was constructed. □ 

Lemma 3.80. Suppose that the loss function is bounded below. If there is 
a minimax rule, then there is a point on &l whose maximum coordinate is 
the same as the minimax risk. 

Proof. Let z be the risk function for a minimax rule, and let s be the 
minimax risk max{^i, . . . , z k }. Let C = R n {x £ JR k : xi < s, for all i}. 
Since the loss function is bounded below, Lemma 3.78 says the risk set is 
bounded below, so C is bounded below. Lemma 3.79 says that the lower 
boundary of C is nonempty. Clearly, the lower boundary of C is a subset 
of the lower boundary of R. Since each point in C is the risk function of a 
minimax rule, the result is proven. □ 
Proof of minimax theorem 3. 77. 13 Let R denote the closure of R. For 
each real s, let 

A s = {z e JR k : Zi < s for all i}. 

Then A 9 is closed and convex for each s. Let s 0 = inf{s : A s n R ^ 0}. 

We will prove first that there is a least favorable distribution. It is easy 
to see that 

s 0 = infsup#(0,<5). (3.81) 
6 e 

Note that the interior of A So is convex and does not intersect R. By the 
separating hyperplane theorem C.5, there is a vector v and number c such 
that v T z > c for all z e R and v T x < c for all x in the interior of A SQ . It is 
clear that each coordinate of v is nonnegative since, if Vi < 0, a sequence of 
points {z n }£° =1 in the interior of A 8o exists with lining xf = -oo and all 
other x? = s 0 -e and lim^^ v T x n = oo > c. So, let A 0 be the probability 
that puts mass A 0)i = v { / Vj on 0, for i = 1, . . . , fc. Since (s 0 , . . . , s 0 ) 

is in the closure of the interior of A 5o , it follows that c > s 0 £ fc =1 v d . We 
can now calculate 3 

inf r(Ao,tf) = inf X^z > > So = M sup R(6 , 6) . (3.82) 

It follows that A 0 is a least favorable distribution. 

Next, we prove that if R is closed from below, there is a minimax rule. 
Let { 5 n}£Li be a decreasing sequence converging to s 0 . Note the following 
facts: 



3 This proof is similar to proofs of Ferguson (1967) and Berger (1985). 



174 Chapter 3. Decision Theory 



• For each n, R fl A Sn ^ 0 is closed and bounded. 

• A S0 = n%L x A Sn ^ 0. 

It follows from the Bolzano- Weierstrass theorem C.6 that RC\A ao is closed 
and nonempty. It follows from (3.81) that every point in RD A 3o is the risk 
function of a minimax rule. Now, apply Lemma 3.79 with C = R fl A SQ to 
see that there is a point of 8l contained in Rf)A So . Since Ol C R, we have 
a point in R that is the risk function of a minimax rule. 

Finally, we prove that a minimax rule S whose risk function is on 8l is 
a Bayes rule with respect to A 0 . Since R(0 ) S) < sq for all 0, it follows that 
r(\oif>) < Sq. Combining this with (3.82) completes the proof. □ 

3.2.5 Complete Classes 

Sometimes minimax rules are too hard to find, or they are not good rules. 
It might be worthwhile just to find all admissible rules. Or, one could find 
a set of rules such that every other rule is dominated by one in the set. 

Definition 3.83. A class of decision rules C is complete if, for every S & C, 
there is So € C such that Sq dominates S. A class C is essentially complete 
if for every S C, there is Sq G C such that R(0,S o ) < R{0,6) for all 0. 
A class C is minimal (essentially) complete if no proper subclass is also 
(essentially) complete. 

Lemma 3.84. 14 IfC is a complete class and A is the set of all admissible 
rules j then A CC. 

Proof. If S 0 g C, then there is S e C such that S dominates S 0 , hence Sq is 
inadmissible, hence Sq £ A. D 

Proposition 3.85. // C is an essentially complete class and there is an 
admissible S <£C, then there isS 0 eC such that R{6, Sq) = R{0, S) for all 0. 

Lemma 3.86. // a minimal complete class exists, it consists of exactly the 
admissible rules. 

Proof. Let C be a minimal complete class and be A the set of admissible 
rules. By Lemma 3.84, A C C. We need to prove CCA. Assume the 
contrary. That is, assume that there is S 0 € C but <5 0 £ A. Then there 
is Si that dominates Sq. Either Si e C or not. If Si e C, call S 2 = S x . If 
Si <£ C, there exists S 2 G C such that S 2 dominates S x . In either case, 6 2 eC 
dominates S 0 . If S 0 dominates some other rule 6, then 6 2 also dominates 
S, so C\{Sq) is a complete class. But this contradicts the fact that C was 
minimal complete. ^ 
There is one famous case in which we can find a minimal complete class. 



14 This lemma is used in the proof of Lemma 3.86. 

15 This theorem originated with Neyman and Pearson (1933). 



3.2. Classical Decision Theory 



175 



Theorem 3.87 (Neyman-Pearson fundamental lemma). Let ft and 

N both be {0, 1}, L(0,0) = L(l, 1) = 0, and 

1,(1,0) = A* >0, L(0,l) = fc 0 >0. 

Le£ /*(#) = dPi/dv{x) for i = 0, 1, w/iere i/ = P 0 + Pi. Le£ <5 6e o decision 
rule. Define <f>(x) = <5(#)({1}). (Tfej function 0 m ca/ted toe test function 
corresponding to 6.) 

Let C denote the class of all rules with the test functions of the following 
forms: 

For each k e (0, oo) and each function 7 : X -> [0, 1], 

1 </ fi(x)> kfo(x), 
<t>k n {x) = ^ 7(x) i/ A(x) = fc/ 0 (x), (3.88) 
0 iffi(x)<kf 0 (x). 



For k = 0, 
For k = 00, 



W 7 I 0 iffi(x) = 0. 



a ( x \-) 1 ifMx) = o, 



= {5 

ZVien C is a minimal complete class. 

Before giving the proof of this theorem, we will give an outline of the 
proof because there are so many steps. We need to prove that if S is a rule 
not in C, then there is a rule in C that dominates S. For a rule S not in C 
we find a rule S'eC that has the same value of the risk function at 0 = o' 
Half of the proof is devoted to this step. We show that the risk functions 
of rules m C at 0 = 0 are decreasing in k, but they may not be continuous. 
However, by defining 7 (x) appropriately, we can find a rule 6* € C such 
that R(0, 6*) = R(0, 6). We then show that R(l, 6*) < R(l 6) 
Proof of Theorem 3.87. Let C be C together with all rules whose test 
functions are of the form «A 0 , 7 in (3.88). Let 6 6 C'\C. Then the test function 
for o is for some 7 such that P 0 (y(X) > 0) > 0. Let S 0 be the rule whose 
test function is <f> 0 . Since h{x) = 0 for all x € A = {x : <f> 0 J x ) + d> 0 (x)\ 
it follows that R(1,S) = R(1,S 0 ). But M ;f? ° W1, 

*(<>,«) = fco[E 0 ( 7 (X)/^(X)) + P 0 (/ 1 (X)>0)] 

= fcoE 0 (7(X)7^(A:)) + R(Q, 6 0 ) > R(0, S 0 ). 

Hence « is inadmissible and is dominated by S 0 . We will now proceed to 
prove that C is a complete class. It will then follow from what we just 
proved that C is a complete class. 
Next, let <f> be a test function corresponding to a rule 6 not in C. Let 

a = R(0,S) = J ko<t>(x)fo{x)dv{x). 



176 Chapter 3. Decision Theory 



Note that a < fc 0 . We will now try to find a rule 6* £ C such that i?(0, 6*) = 
a and R(l,6*) < R(l,6). To that end, we define the function 

g{k) = / k 0 fo{x)du(x). 

^{/i(x)>fc/o(x)} 

Note that if y(x) = 1 for all x and 6* has test function 0fc i7 , then #(fc) = 
R(0,6*). Since {x : /i(x) > kfo(x)} becomes smaller as fc increases, and 
fi{x) < oo a.e. [z/], it is easy to see that g(k) decreases to 0 as fc — > oo. Also, 
it is easy to see that g(0) = fco > a. We now prove that g is continuous 
from the left and we find the limit from the right. First, note that 

f){x:fi(x)>kf 0 (x)} = {x:/iOr)>ro/o(*)}, (3.89) 

k<m 

U {x : f x (x) > kf 0 {x)} = {x : h{x) > m/ 0 (x)} U {x : / 0 (x) = 0}. 

k>m 

Because g is bounded, the monotone convergence theorem A. 52 gives that 
lim g(k) = ^(m) 

fc|m 

lim^(fc) = / k 0 fo{x)dv(x), (3.90) 

fe ^ m ^{x:/!(x)>m/ 0 (x)} 

hence g is continuous from the left. Note that if 7(2;) = 0 for all x and 6 
has test function 0 m/y , then ii(0,6*) = lim fcim ^(fc). Since ^ is continuous 
from the left, either there is a largest fc such that g(k) > a or there is 
a smallest fc such that #(fc) = a. The first case occurs if g has a jump 
discontinuity and it jumps from a value greater than a down to a value 
at most a. The second case occurs if g drops continuously to a. In any 
case, let the guaranteed value of fc be denoted fc*. If a = 0, it may be that 
fc* = 00. If a > 0, we must have fc* < 00 because g decreases to 0. 

We will construct a decision rule called 6* whose test function <t> has the 
form of 0fc*, 7 . We consider the three possible cases: 

1. a = 0 and fc* < 00, 

2. a = 0 and fc* = 00, 

3. a > 0 and fc* < 00. 

We begin by proving that by appropriate choice of the function 7, we can 
make#(0,<5*) = i?(0,<5) = a. 

In the first case, we can use (3.89) and (3.90) with m = fc* to show that 
7 ( x ) = 0 makes R(0, 6*) = 0 = a. In the second case, 



R(0,6*) = / k 0 <t>oo{x)fo(x)dv(x) = 0 = a. 



3.2. Classical Decision Theory 177 

In the third case, if g(k*) = a, set 7(2:) = 1 to make R(0,6*) = g(k*) = a. 
Otherwise, g(k*) > a. Set the right-hand side of (3.90) with m = k* equal 
to v < a. Because g has a jump discontinuity at fc*, it must be that 

k 0 Po(fi(x) = *7o(*)) = -t/>a-t/>0. 
For those x such that = fc*/ 0 (x), define 

0<7(»)= "7" <i. 
g(k*) -v 

It follows that 

= v+ k 0 ,1~ V fo(x)dv(x) 

J{M*)=k'Mx)} 9(k*)-v 

If fc* < 00, define 

M*) = fo*-, 7 (aO - 0(*)][/i(x) - k'Mx)]. 

We know that &., 7 (x) = 1 > 4>{x) for all x such that / x (x) - jfc*/o(x) > 0 
and <f> k *,y(x) = 0 < <6(x) for all x such that ^(x) - fc*/ 0 (x) < 0. Since 
<t> is not of the form of some c£ fc>7 , then there must be a set B such that 
v{B) > 0 and h(x) > 0 for all x € B. Since v = P 0 + Pi , we get that 
/o(z) + /i(x) = l a .e. \v). So, 

0 < j^h(x)du{x)< J h{x)dv{x) 

= j{<t>k* n {x) - <f>{x)]h{x)dv{x) - fc* J [0 fc ., 7 (x) - ^(x)]/ 0 (x)d I /(x) 

= / \<t>k> , 7 (x) - 0(x)]/x (x)di/(x) + £ (a - a) 

= ~[R{l,6)-R{\,6*)). 
Hence i?(l,<5) > 

If ** = oo, then #(0,(5) = 0, and hence 0(x) = 0 for almost all x such 
that /o(x) > 0. So 

R(l, 6) = fc x P 0 (/ 0 (X) > 0) + ki I [1 - 4>{x)]h(x)du(x) 

J{x:f o (x)=0} 

> k 1 P 0 (f 0 (X)>0) = R(l,6*), 



178 Chapter 3. Decision Theory 



where the inequality follows from the fact that 

/ [l-(t>(x)}f l (x)du(x) = 0 

J{x:f o (x)=0} 

implies that (j>(x) = <t>oo(x), a.e. [v\. Since 6 was assumed not to be in C, 
this cannot happen. 

What we have shown is that for every 6 £ C, there is 6* 6 C such that 
6* dominates 6. Hence C is complete. As claimed earlier, it now follows 
that C is complete. 

It is easy to check (see Problem 29 on page 212) that no element of 
C dominates any other element of C, so nothing can be removed without 
destroying the completeness of C. Hence, C is minimal complete. □ 

Notice that C consists of all Bayes rules with respect to those priors with 
positive mass on each 9 (see Theorem 3.91), plus only one Bayes rule with 
respect to each of the priors that put mass 0 on one of the 9 values. 

Proposition 3.91. In the decision problem described in Theorem 3.87, 
each rule 0fc )7 is a Bayes rule with respect to a prior that assigns positive 
probability to each parameter value. The only admissible Bayes rule with 
respect to the prior that says Pr(0 = 0) = 1 is </>oo, and the only admissible 
Bayes rule with respect to the prior that says Pr(9 = 1) = 1 is fa. 

Example 3.92. Let 0i > 0 O , and let / 0 and fi be the iV(0 o ,l) and iV(0i,l) 
densities, respectively. Then, for any k, fi(x)/fo(x) > k if and only if 

X> 2 + 0i-0 o ' 
There is no need to introduce 7jt(x), since equality has zero probability. 
Example 3.93. Let 

/o(x) = - po)—, Mx) = - Pi)""", 

for some 1 > pi > po > 0. Then, for any k, fi(x)/fo(x) > k if and only if 

nlog(i^) +logfc 
x > 7 ~t • 

For example, if pi = 0.9, p 0 = 0.5, and n = 10, we get 

16.09 + log(fe) 
X > 2.197 

If k = 4.408, for example, x = 8 is the cutoff, and 7(8) must be chosen. 



3.2. Classical Decision Theory 



179 



Example 3.94. Let 



fo(x) = 



[0,12](^)- 



These are Bin(l2, 1/2) and t/(0, 12) distributions. The measure /x is Lebesgue 
measure plus counting measure on the integers. To make both distributions ab- 
solutely continuous, we must change fi{x) to 



There is only one admissible rule, namely, do what is obvious. If the observed 
x is an integer, choose the binomial distribution; otherwise, choose the uniform 
distribution. 

There is an analog to the minimax theorem for complete classes. The 
proof is similar also and is adapted from Ferguson (1967). This theorem can 
also be thought of as a generalization of the Neyman-Pearson fundamental 
lemma to larger, but still finite, parameter spaces and more general action 
spaces. However, explicit forms for the decision rules are not given because 
the action space is unspecified. 

Theorem 3.95 (Complete class theorem). Suppose that il has only k 
points, the loss function is bounded below, and the risk set is closed from 
below. Then the set of all Bayes rules is a complete class, and the set of 
admissible Bayes rules is a minimal complete class. These are also the rules 
whose risk functions are on the lower boundary of the risk set. 

First, we need a lemma. 

Lemma 3.96. 16 Suppose that ft has only k points and the loss function is 
bounded below. Let the risk set be R. Then 

• every admissible rule is a Bayes rule, and 

• for every point z e d L , there exists a prior X T = (Ai, . . . , A fc ) such 
that A T z = inf y ^R\ T y. 



This lemma is used in the proofs of Theorem 3.95 and Lemma 4.43. It is also 
used in the discussion of testing simple hypotheses against simple alternatives in 
Chapter 4. 



li 




Then, 




undefined otherwise. 



if 0 < x < 12, x not an integer, 
ifx = 0,1,...,12, 



180 Chapter 3. Decision Theory 



Proof. If a rule is admissible, it is clear that its risk function is a point in 
8l> So, let z G c?l, and define 

A — {x : Xi < Zi, for all i). 

Then A and R are disjoint convex sets, so the separating hyperplane the- 
orem C.5 says that there exists a constant c and a vector v such that for 
all x e A and all y € v T y > c > v T x. If a coordinate Vi < 0, we can 
find a point x such that x j = Zj -e for j ^ i and is sufficiently negative 
so that v T x > c, a contradiction. So we know that all coordinates of v are 
nonnegative. Set 



Since z is a limit point of A, there exists a point x € A such that v T x is 
arbitrarily close to v T z. Hence c = v T z, and A T y > X T z for all y G R. So, 
if z e i?, then 2 is the risk function of a Bayes rule with respect to A. If 
z 0 R, then it is still true that \ T z — inf yG n A T y. □ 
Proof of Theorem 3.95. From the definition of the lower boundary of 
the risk set, it is clear that every point on 8l corresponds to an admissible 
rule. It is also clear that to every point in R not on 8l there corresponds a 
point in R that dominates it. Hence the lower boundary contains the risk 
functions of all and only admissible rules. 

Next, we show that the rules whose risk functions are on 0l form a 
minimal complete class. For each z € J?, define 

A z = {x : X{ < Zi, for all i}. 

Let z not be on d L . Then there exists z' e R such that z 1 dominates z 
and A z t C A z . If z f € c?l, we are done; if not, apply Lemma 3.79 with 
C = A z > fl R to conclude that there exists a point in 8l that is at least 
as good as z' and hence dominates z. This makes the admissible rules a 
complete class. Since no admissible rule can be dominated, it is also a 
minimal complete class. 

Lemma 3.96 shows that di consists of the risk functions of all admissible 
Bayes rules, so these rules form a minimal complete class. Since the set 
of Bayes rules contains the set of admissible Bayes rules, the set of Bayes 
rules is a complete class also. D 

Notice that the complete class theorem 3.95 says that the rules 0 fc ,7> 
0o, and 0oo in the Neyman-Pearson fundamental lemma 3.87 are the rules 
whose risk functions are on 8l in the risk set. 



3.3. Axiomatic Derivation of Decision Theory 181 

3.3 Axiomatic Derivation of Decision Theory* 

In Section 3.1, we claimed that a Bayesian would choose that decision rule 
which minimized the expected loss with respect to the posterior distribu- 
tion of 6. (Recall the expected loss principle.) This may seem reasonable 
or ad hoc on the surface, but there is some justification for such a princi- 
ple. Von Neumann and Morgenstern (1947) and Anscombe and Aumann 
(1963) set up a system of axioms for preferences among decisions, which 
lead to the conclusion that one should choose the decision that minimizes 
the expected loss. In this section, we present those axioms together with 
a proof of the important conclusion. For alternative derivations, see DeG- 
root (1970), Savage (1954), Fishburn (1970), or Ferguson (1967). In this 
section, we prove Theorem 3.108, which says that so long as preferences 
satisfy some axioms, there is a probability distribution and a utility func- 
tion (like the negative of a loss function) such that, in any comparison of 
decisions, the one with higher expected utility is the more preferred de- 
cision. We also prove Theorem 3.110, which says that if data are to be 
observed before making decisions, then the comparison should be based on 
conditional expected utility given the data. 

We begin with background information in the form of definitions and 
axioms. Then we present some examples of the axioms, the statements of 
the main results, and the proofs. 

3.3.1 Definitions and Axioms 

The setup we will consider is one in which there is uncertainty about which 
of several possibilities will occur. Each possibility will be called a state (of 
Nature). Let R be the set of states of Nature. Let A\ be a cr-field of subsets 
of R. Assume that (R,A\) is a Borel space. We will also assume that the 
final outcomes of our choices will be that we will receive some consequence 
or prize. In general, let the set of prizes be an arbitrary set P with a-field 
of subsets A2 which contains all singletons. We assume that all prizes are 
available in all states. 

In addition to states of Nature, we will assume that there is also un- 
certainty about the outcome of another experiment about which we are 
willing to specify probabilities. This experiment is assumed to be capable 
of producing events with arbitrary probabilities, and we are completely 
indifferent between the possible outcomes prior to choosing between the 
options available to us. We also assume that this experiment is indepen- 
dent of the state of Nature. One can think of this experiment as a spinner 
with arbitrarily fine precision which we believe to be fair. The purpose of 
this experiment is to allow us to refer to probability distributions over the 



*This section may be skipped without interrupting the flow of ideas. 



182 Chapter 3. Decision Theory 



set of prizes. 

Definition 3.97. A Von Neumann-Morgenstern lottery (NM-lottery) is 
a probability on the space (P,^) that is concentrated on finitely many 
prizes. 17 Call the set of all NM-lotteries C. 

For convenience, if L € C gives probability 1 to a prize p, then we will often 
denote L by p. If the NM-lottery L awards prize Pi with probability a* for 

i — 1, . . . , fc, we will denote it by L = a\p\ H hafcPfc, where a* > 0 for all 

i, ]Ci=i c*i = I- This NM-lottery is to be interpreted as meaning that the 
results of the experiment T are to be partitioned into A: events Ei, . . . , Ek 
with probabilities c*i, . . . , a^, respectively, and prize p* is awarded if event 
i?i occurs. We assume that the details of the partitioning of the results of 
T are irrelevant to us. We only care about the probabilities of the various 
prizes. 

The choices we need to make will be among NM-lotteries and some more 
general gambles. In order to make choices among gambles, we need to be 
able to say which ones we like better than others. 18 We will use the symbol 
■< to indicate weak preference, that is, L -< V means that we like L' at 
least as much as L. Assuming that a weak preference ^ is defined on £, we 
can now define the more general class of gambles to which we will extend 
These gambles will be defined as functions from R to C. 

Definition 3.98. A function H : R C is called a horse lottery if, for 
each NM-lottery L, {r : H(r) -< L) € A\. Let H denote the set of all horse 
lotteries. 

If H(r) = L for all r, we will often denote if by L. A horse lottery is a 
gamble that pays off the result of a state-specific NM-lottery depending on 
which state occurs. 



17 The reason that we choose this restricted set of probability distributions 
rather than the set of all probability distributions on (P,^) is threefold. First, 
since our axioms will require that various relations hold for all NM-lotteries, the 
smaller the set of NM-lotteries is, the less restrictive are our axioms. Second, 
there is one useful result (Corollary 3.143) that fails without further assumptions 
if we allow all probabilities to be NM-lotteries. Third, since we will only consider 
measures on (P,^) which concentrate on finitely many points, the a-field Ai 
can be quite arbitrary. 

18 There is one assumption implicit in all of this discussion which is not of 
a mathematical nature. The assumption is that our choices do not affect our 
opinions. As an example that violates this assumption, suppose that I am trying 
to decide whether or not to offer accidental death insurance to an individual. If I 
sell the insurance, the individual may be more inclined to act in a risky manner 
(such as mountain climbing or bungee jumping). If the individual does not have 
insurance, he or she may be less inclined toward the risky behavior. The theory 
described in this section assumes that these considerations are absent from our 
choices. 



3.3. Axiomatic Derivation of Decision Theory 



183 



It is useful to have notation for strict preference and indifference also. 
If Hi r< H 2 but it is not the case that H 2 < #1, then we say H\ < H 2 
and that H 2 is strictly preferred to H\. If both H\ •< H 2 and H 2 <H\, we 
say H\ ~ H 2 and we are indifferent between H\ and H 2 , or H\ and H 2 are 
indifferent. 

The first axiom says that the relation < is a weak order, which we define 
now. 

Definition 3.99. A binary relation X on a set A is a weak order if it 
satisfies the following conditions: 

• For every a e A, a<a. 

• For every a, 6 6 A, either a ■< b or b •< a, or both. 

• If a < b and b ■< c, then a ^ c. 

We say that •< is degenerate if a ^ b for all a, 6 € A. If ^ is not degenerate, 
we say it is nondegenerate. 

Note that a preference is degenerate if and only if all horse lotteries are 
indifferent. 

Axiom 1 (Weak order) . The relation of weak preference ■< is a weak 
order on the set H of horse lotteries. 

The first axiom does not put very many constraints on the possible pref- 
erences we could express. There is the constraint that preferences be tran- 
sitive, but this is not a very controversial requirement. There is also the 
constraint that every pair of horse lotteries be affirmatively compared. That 
is, for every Hi and H 2 , either Hi -< H 2 or H 2 •< Hi (or both). It is not 
allowed that I refuse (or am unable) to compare Hi with H 2 . Seidenfeld, 
Schervish, and Kadane (1995) study, in depth, the problems that arise if 
one relaxes Axiom 1 to allow that certain horse lotteries are not compared. 

Since the set C is convex, the set of all functions from R to C is convex 
with H = aHi + (1 - a)H 2 defined by H(r) = aH x (r) + (1 - a)H 2 (r) for all 
r. Unfortunately, the requirement that {r : H(r) ^ L} E Ai for all L E C 
prevents us from concluding immediately that aHi + (1 - a)H 2 e H, even 
if both Hi,H 2 € H. However, all NM-lotteries satisfy this condition. That 
is, if Hi(r) = Li for all r and H 2 (r) = L 2 for all r, then aHi + (1 — a)H 2 — 
aLi + (1 - a)L 2 € H. Eventually, we will be able to prove that H is convex. 
(See Lemma 3.131.) Until we do so, however, many results will have to be 
stated in such a way that they still apply without H being convex. This 
will be done by adding a condition that the result hold not necessarily for 
all horse lotteries, but only for every convex set of horse lotteries. The set 
of constant horse lotteries (identified with C as above) is convex, so the 
condition is not vacuous. 

Another requirement of the same sort as transitivity is one that says that 
if I prefer one prize to another, then I should prefer a gamble that gives 



184 Chapter 3. Decision Theory 



me that prize with some probability to a gamble that gives me the other 
prize with the same probability, all other things being equal. We make this 
precise with another axiom. 

Axiom 2 (Sure-thing principle). For each convex set Ho of horse lot- 
teries; for every three horse lotteries H X ,H 2) H € Hq; and for every 0 < 
ol < 1, Hi X H 2 if and only if aH x + (1 - a)H r< aH 2 + (1 - ol)H. 

Two other axioms are often assumed for mathematical reasons. The first 
assures that no horse lottery is worth infinitely more than another. A def- 
inition is required first. 

Definition 3.100. Let L be an NM-lottery and let {L k )kLi be a sequence 
of NM-lotteries. We say that L k converges pointwise to L (denoted L k — ► L) 
if and only if, for every A G A 2 , lim^— L k (A) = L(A). 

Since each NM-lottery is a probability distribution over (P, ^2)5 ^ is a 
function from A 2 to [0, 1]. The above definition of pointwise convergence 
agrees with the usual concept of pointwise convergence of functions. 

Axiom 3 (Continuity). Let H be a horse lottery, and let {H k )kLi be a 
sequence of horse lotteries such that Hk(r) converges pointwise to H(r) for 
all r. Let H f be another horse lottery. If H k ^ H' for all fc, then H -< H* '. 
If H' * H k for all k } then H' ■< H. 

The next axiom assures that the relative values of prizes do not vary from 
state to state. It is not obvious that such an axiom should be adopted. In 
fact, this axiom appears to be nothing more than a mathematical tool for 
ensuring that probability and utility can be separated. In Section 3.3.6, we 
will consider what happens if we do not assume Axiom 4. A definition is 
required before we can state Axiom 4. 

Definition 3.101. If E e Ai and Hi,H 2 e H, the preference between H x 
and if 2 is said to be called-off when event E occurs if H i(r) = H 2 (r) for all 
r E E. A set B € A\ is called null if whenever the preference between Hi 
and H 2 is called-off when B c occurs, we have Hi - H 2 . A subset is called 
nonnull if it is not null. Similarly, a state 5 is nonnull if the singleton set 
{s} is nonnull. 

Axiom 4 (State independence). Suppose that there is a nonnull set 
B such that the preference between H x and H 2 is called-off if B occurs. 
Suppose also that H x (r) = L x and H 2 (r) = L 2 for r € B. Then Hi 1 H 2 
if and only if L x <L 2 . 

An interesting discussion of state independence is given by Schervish, 
Seidenfeld, and Kadane (1990). 

Next, we introduce an axiom that is only needed when there are infinitely 
many states of Nature. When there are infinitely many possible prizes and 
states, Seidenfeld and Schervish (1983) show that it is possible for Hi to 



3.3. Axiomatic Derivation of Decision Theory 185 

be preferred to H2 merely because H\ offers more possible prizes than #2, 
even though H2 offers more valuable prizes than H\. To avoid this problem, 
we introduce an axiom. 

Axiom 5 (Dominance). If H\{r) -< H 2 (r) for all r, then Hi ^ H 2 . 

The dominance axiom ties the values of horse lotteries to the values of 
the NM-lotteries which they assume. It can be shown (see Problem 38 on 
page 213) that Axioms 1-4 imply dominance when R is finite. An example 
in which Axioms 1-4 hold but Axiom 5 does not is in Example 3.107. 

Next, we define conditional preference and introduce an axiom to link 
preferences with conditional preferences. 

Definition 3.102. Let (X,B) be a measurable space, and let X : R — ► X 
be a random quantity. We define a conditional preference relation given X 
to be a set of binary relations on H, {^ x : x G X}, where H\ < x H2 is read 
as "we would like H2 at least as much as H\ if we were to learn that X = x 
and nothing else of relevance," and 

• for every x G X, H G H, L G £, {r : H(r) ^ x L} G A\; 

• H\ < x H 2 if and only if, for every pair {H[,H f 2 ) of horse lotteries 
satisfying H[{r) = Hi(r) for i = 1, 2 and all r € X~ l {{x}), H[ ^ x H' 2 \ 
and 

• for all H U H 2 H x ^ x H 2 } G B. 

The first condition in the list is to ensure that the same horse lotteries 
are compared conditionally as unconditionally. The second condition guar- 
antees that conditional on X = x, it does not matter what would have 
happened if X = y ^ x had occurred. This makes conditional preference 
truly conditional. The measurability condition is added for mathematical 
convenience. None of the axioms stated for unconditional preference says 
anything about conditional preference. We now suppose that the following 
axiom holds. 

Axiom 6 (Conditional preference). If (X,B) is a measurable space, 
X : R —+ X is a random quantity, and {^ x : x G X} is a conditional 
preference relation given X, then 

• for all x, -< x satisfies Axioms 1-5; 

• Hi -< x H2 for all x implies Hi ■< H2; 

• if Hi -< x H2 for all x and Hi -< x H2 for all x G B, where X~ 1 (B) is 
nonnull, then Hi -< #2- 

This axiom says that if we know that we will prefer H2 to Hi after we ob- 
serve X no matter what value we observe for X, then we should prefer if 2 to 
Hi now. In the case in which there are only finitely many states of Nature, 



186 Chapter 3. Decision Theory 



there is a means of deriving conditional preferences from unconditional 
preferences, but one still needs an axiom to say that the derived prefer- 
ences should be used conditionally. The method is to make use of called-off 
preferences. If X : R X is a random quantity and X~ x ({x}) = E, then 
one can require that conditional preferences given E agree with preferences 
that are called-off when E c occurs. 

If we must condition on more than one quantity, there is an issue of 
consistency. For example, if Y is a function of X and we first condition on 
Y and then on X, do we get the same conditional preferences as if we had 
conditioned on X alone? 

Definition 3.103. Let X : R -> X and Y : R -» y be random quantities 
such that Y is a function of X, Y = h(X). We say that conditional prefer- 
ence relations {^ y :y ey} and {^ x : x 6 X} given Y and X are consistent 
if there exists a set B C y such that Y~ l (B) is null and for every y & £?, 
{:<*: £ € h~ l (y)} is a conditional preference relation relative to -< y which 
satisfies Axiom 6. 

3.3.2 Examples 

Here are a few examples to illustrate how the axioms can be satisfied or 
violated. 

Example 3.104. Suppose that there are only two states of Nature n and r 2 . 
Suppose also that there are only two prizes, pi and p2- Then horse lotteries are 
characterized by the pair of numbers (91,92), where 9* is the probability of p 2 in 
state r» for i = 1,2. We give two examples of weak preference, one which satisfies 
the axioms and one which does not. 

First, suppose that we claim that (91,92) ^ (91,92) if an d only if 91+92 < 
Qi + 92- It is straightforward to check that this satisfies all of the axioms. The 
representation of preference according to Theorem 3.108 below will be Pr(n) = 
Pr(r2) = 1/2 and U(p 2 ) > U(pi). Consider two horse lotteries Hi = (91*92) and 
H2 = (91,93). The preference between H\ and Hi is called-off if {n} occurs. It 
is easy to see in this example, and it can be shown in general that if r 2 is nonnull 
(see Lemma 3.148), then Hi < H 2 if and only if H 3 * H 4 for all H 3 , H A of the 
form Hz — (94,92), H 4 = (94,93). That is, a pair of horse lotteries that differ 
only on a nonnull state are ranked the same as every other pair that differ the 
same way on the same state. 

Second, suppose that we claim that (91,92) ^ (91,92) if an d only if 91 < 91 
or (91 = q[ and 92 < 92). This fails Axiom 3, and there is no expected utility 
representation for the preferences. 

Example 3.105. Let P = {pi,...,Pm} and R = {n, . . . ,r n }. Let gi,...,9n 
be nonnegative numbers that add to 1. Let ui,...,ii m be real numbers. For 
each NM-lottery L = aipi + • a m p m , define U(L) = YZ,\ aiUi ' For each 
horse lottery H = (Li,...,L n ), define U(H) = Y*%i<h u { L >)' Sa ^ that Hl - 
H 2 if and only if U(Hi) < U(H 2 ). It is easy to see that this is a weak order. 
Since U{aLi + (1 - a)L 2 ) = c*il/(Li) + (1 - a)U(L 2 ) t it is easy to verify that 
Axiom 2 holds. Continuity follows since L k -> L implies U(L k ) -> U(L). State 



3.3. Axiomatic Derivation of Decision Theory 



187 



independence follows easily from the definition of U (H). Theorem 3.108 says that 
all examples in the finite case will be like this. 

Let X : R — ► X be a random quantity. Clearly, X can take on only finitely 
many values. For each value x of X, let X~ l ({x}) = {Pki(x), • • • >Pfc Sx(x) }• If 
v x — ^2 S j=iQkj(x) = 0, then X _1 ({x}) is null. Otherwise, define U X (H) = 
Djli Qkj(x)U{L kj{x) ), and say that Hi < x H 2 if and only if U x (Hi) < U X (H 2 ). It 
is easy to verify that this is a conditional preference and that it satisfies Axiom 6. 
Theorem 3.110 says that all conditional preferences must be of this form in the 
finite case. 

Example 3.106. Let (R,A\) be an arbitrary Borel space with a probability 
Q, and let P be an arbitrary set. Let U : P — > 1R be a bounded function. 
For each NM-lottery L = aipi + •••a fc p fc , define U(L) = <*iU(pi). Let 

H be the set of all functions H : R — > C such that U(H(r)) is a measurable 
function of r. For H G ft, define U{H) = f U(H(r))dQ{r). Say that Hi * 
H 2 if and only if U(Hi) < U(H 2 ). For each H e H and each L G £, {r : 
i/(r) ^ L} = {r : C/(JJ(r)) < C/(L)} G because we assumed that U(H(-)) 
is measurable. Axiom 1 clearly holds. To see that Axiom 2 holds, note that 
U(aHi + [1 - a]tf 2 ) = aU(Hi) + (1 - a)U(H 2 ). If if n (r) if(r) for all r, 
then lim n — oo U(H n (r)) = U(H(r)) for all r. Since £/ is bounded, the dominated 
convergence theorem A. 57 says that lim n — oc U(H n ) = U(H). This implies that 
Axiom 3 holds. Note that B € Ai is null if and only if Q(B) = 0. To see that 
Axiom 4 is satisfied, let the preference between Hi and H 2 be called-off when 
B c occurs, B is nonnull, and Hi(r) = L x and H 2 (r) = L 2 for all r e B. Then 
[/(ffi) - t/(H 2 ) = Q(B)[U(Li) - U(L 2 )}, so Hi ± H 2 if and only if Li * L 2 . 
Axiom 5 follows easily from part 3 of Proposition A. 49. Theorem 3.108 says that 
when R and/or P is infinite, this example describes all preference relations that 
satisfy the axioms. 

Let X : R — ► X be a random quantity. Let {Q(-\x) : x 6 X} be a regular 
conditional distribution given X. (Use Corollary B.55 to choose a version of 
Q(-\x) that gives probability 1 to X~ l ({x}).) For each x eX and H eH, define 
t/ x (tf) = / [/(if(r))dQ(r|z). If H u H 2 e H, Theorem B.46 can be used to show 
that U x (Hi) is a measurable function of x for each 2, hence {x : Hi -< x H 2 } G 5, 
and x G is a conditional preference. Axiom 6 follows from the law of 
total probability B.70. Theorem 3.110 says that except for differences on null 
sets, this example describes all conditional preferences that satisfy Axiom 6. 

Example 3.107. This example is based on one of Seidenfeld and Schervish 
(1983), and it is designed to show why Axiom 5 is needed in the infinite case. 
Let R = [0, 1] and let Ai be the Borel cr-field. Let Q be Lebesgue measure, 
and let P = [0, 1]. Let V : P -> IR be defined by V{p) = p. For each NM- 
lottery L = aipi -f •••afcPfc, define V(L) = XlLi ^^(pO- For each function 
H : R £, define WH(p,r) = H(r)({p}), that is, the probability that H(r) 
assigns to the prize p. Let H be the set of all H such that wh{p,t) is a mea- 
surable function of r for every p and V(H(r)) is a measurable function of r. For 
HeH, define V(iJ) = / V(H(r))dQ(r). Note that V is the same as the U 
in Example 3.106. Define wh(p) = f WH(p,r)dQ(r). Since V) wh(p,v) — 1 

All p 

for all r, there can be at most countably many p such that wh(p) > 0. Define 
W(H) = 1 - J2 wh(p)- The value W(H) measures the extent to which more 

All p 

than countably many different prizes are assigned by H. For example, if the set 
of all prizes assigned by H is countable, then W(H ) = 0. In particular, it is easy 



188 Chapter 3. Decision Theory 



to see that W(L) = 0 for all L G C. Define U(H) = V(H) + W(H) and say 
that Hi ■< if 2 if and only if U(Hi) < U(H 2 ). Axiom 1 is clearly satisfied. To see 
that Axiom 2 is satisfied, note that w aHl + {l _ a]H2 (p) = aw Hl (p) + (1 - a)w H2 (p) 
for all p, so W(a#i + [1 - a]H 2 ) = aW(Hi) + (1 - a)W(H 2 ). Now, use the 
fact that V(aH x + [1 - a]i/ 2 ) = aFfft) + (1 - a)V(tf 2 ) as shown in Ex- 
ample 3.106 to see that U(aHi + [1 - a]H 2 ) = a£/(#i) 4- (1 - a)U(H 2 ). If 
^n(r) H(r) for all r, then lim^oo w Hn (p, r) = w H (p,r) for all p,r. Let 
{Pi}Si be the prizes such that either w Hn (pi)> 0 for some n or wn(pi) > 0. 
Define f(r) = E?^ u>*(pi,r) and f n (r) = £~ x w„ n ( Pi , r). Then W(H) = 
S f(r)dQ(r) and W(H n ) = / f n (r)dQ(r). Since 0 < f n < 1 and lim n -*oo fn{r) = 
/(r) for all r, it follows from the dominated convergence theorem A.57 that 
lim ? ^oc W(H n ) = As shown in Example 3.106, lim n _oo V(tf n ) = V(if ), 

so limn-.oo U(H n ) = U(H) and Axiom 3 holds. To see that Axiom 4 is satisfied, 
let the preference between Hi and H 2 be called-off when B c occurs, B is non- 
null, and Hi(r) = Li and H 2 (r) = L 2 for all r e B. Then W(#0 = W(Jf 2 ) = 
^All P he ™HAP,r)dQ{r), so U(Hi) - t/(if 2 ) = Q(B)[V(L X ) - V(L 2 )\. We see 
that Hi ^ # 2 if and only if Li < L 2 . To see that Axiom 5 is violated, let 
Hi(r) = lr, that is, the NM-lottery that gives prize p — r with probability 1. Let 
H 2 = 1, that is, the constant horse lottery that gives the prize 1 with probability 
1 for all r. It is easy to calculate V(Hi) = 1/2, W{H X ) = 1, V(H 2 ) = 1, and 
W(H 2 ) = 0. So U(Hi) = 3/2 > 1 = C/(iJ 2 ). But U(Hi(r)) = V(#i(r)) = r 
for all r and U(H 2 (r)) = V(l) = 1 for all r. So ffi(r) ^ # 2 (r) for all r, but 
if 2 -< if i . Note that the ranking of horse lotteries by U is not an expected utility 
representation because of the added function W. 

3.3.3 The Main Theorems 

Since the proofs of the major theorems are very long and not particularly 
straightforward, we state here, for the interested reader, the main results. 
The proofs will be given in Section 3.3.5. 

Theorem 3.108. Assume Axioms 1-5, and assume that preference is non- 
degenerate. Then, there exists a bounded function U : H — > 2R such that 
U(H(r)) is a measurable function of r for all H GW and that satisfies 

U(aH x + [1 - a]H 2 ) = aU(H x ) + (1 - a)U(H 2 ), (3.109) 

for all a G [0, 1] and all Hi,H 2 . Also, there exists a probability Q on (R, Ai) 
such that for every H\,H 2 G H, Hi •< H 2 if and only if f U(Hi\r))dQ(r) < 
J U(H 2 (r))dQ(r). The probability Q is unique, and U is unique up to pos- 
itive affine transformation. 

The function U in Theorem 3.108 is called a utility function. 
We also prove a theorem linking preference and conditional preference. 

Theorem 3.110. Assume the conditions of Theorem 3.108. Let (X,B) be 
a Borel space, and let X : R — ► X be a random quantity. Let Q be the 
probability from Theorem 3.108, and let {Q(-\x) : x € X) be a regular 
conditional distribution given X. Let {^ x : x € X} be a conditional pref- 
erence relation given X which satisfies Axiom 6. Then there exists a set 



3.3. Axiomatic Derivation of Decision Theory 189 



B such that X l (B) is null and for all x £ B, Hi ^ x H2 if and only if 
fU(H!{r))dQ(r\x) < f U(H 2 (r))dQ(r\x). 

In Theorem 3.110, if Y : R — ► y is a function of X, and {-< y : y £ y} is 
a conditional preference relation given V, then {^ y : y € 3^} is consistent 
with {^ x : x € X} because Theorem B.75 and Corollary B.74 say that 
conditioning on Y and then X is the same as conditioning on X alone. 

3.3.4 Relation to Decision Theory 

Earlier in this chapter we set up decision theory using action spaces and 
loss functions. There is a natural connection between these concepts and 
the concepts introduced in this section. Let the states of Nature be possible 
values of the parameter (R = ft), or of some future observable (R = V), 
or possibly data and parameter together (R = X x ft). Let the actions 
(elements of N) index functions from states to prizes (or, more generally, 
from states to NM-lotteries). That is, for each a € N, there exists a horse 
lottery H a : R C such that H a {r) is the prize (NM-lottery) we get in 
state of Nature r. For example, if R = ft, then we can consider L(0, a) = 
c — U(H a (0)) for arbitrary c. In this way bounded loss functions are like 
the negatives of utility functions. Unbounded loss functions, however, do 
not correspond to utilities that satisfy the axioms stated in this chapter. 

Example 3.111. Suppose that R = ft is the interval [co,ci]. Let P, the set of 
prizes, be a bounded interval of monetary units containing my current fortune y 
and having half- width at least (ci — Co) 2 . Let U(p) =p for each p € P. If N = ft, 
then for each action a G N we construct the horse lottery H a {0) — y - (a — 0) 2 ', a 
function from ft to P. Then y — U(H a (9)) = (a - 0) 2 is squared-error loss. Note 
that we used the bounded intervals to ensure that utility is bounded. 

Axiomatic developments like the one given in this section have two main 
consequences. The obvious one is that, as stated in the main theorems, 
if one satisfies the axioms, then the preferences have an expected utility 
representation. The contrapositive is also true. If preferences do not have 
an expected utility representation, then at least one axiom must be violated. 

Example 3.112 (Continuation of Example 3.72; see page 170). The minimax 
principle is often in conflict with Axiom 2. In this example, the minimax rule 
corresponds to the convex combination 

(0.3846,0.3846) = 0.7692(0.5,0.2) + 0.2308(0, 1) 

in Figure 3.73. According to the minimax principle, (0,1) -< (0.5,0.2) because 
0.5 < 1. If Axiom 2 were satisfied, then 

(0.3846, 0.3846) = 0.7692(0.5, 0.2) + 0.2308(0, 1) 

-< 0.7692(0.5, 0.2) 4- 0.2308(0.5, 0.2) 
= (0.5,0.2). 



190 Chapter 3. Decision Theory 



But (0.5,0.2) -< (0.3846,0.3846) according to the minimax principle, because 
0.3846 < 0.5. Hence, the minimax principle violates Axiom 2 in this example. 19 



3.3.5 Proofs of the Main Theorems* 

The theorems we prove pertain to the general case in which (RiAi) is 
an arbitrary Borel space and P is an arbitrary set. Some readers may 
wish to focus on the finite case in which R = {ri,...,r n }, A\ = 2 fi , 
P = • • • >Pm}, and A 2 — 2 P . In each of the results, we will point out 
how the proofs can be simplified (usually by skipping major portions) in 
the finite case. In the finite case, if H(ri) = Li for i = 1, . . . ,n, then we 
will denote H = (Li, . . . , L n ). Since A\ — 2 H , it follows that H is convex 
in the finite case. In the finite case Axiom 5 follows from Axioms 1-4. (See 
Problem 38 on page 213.) 

Until we prove that H is convex in general (Lemma 3.131), all of the 
lemmas we prove will need to contain conditions like the one that appeared 
in Axiom 2 concerning an arbitrary convex set Ho of horse lotteries. Once 
we prove Lemma 3.131, then Ho can be taken equal to H in. all of these 
results. For this reason, in this section we will assume that Ho is a convex 
set of horse lotteries. The theorems in this section will apply to every 
such set Ho. 20 Because some of the lemmas in this section are also useful 
in Section 3.3.6, the hypotheses often include which axioms are assumed 
explicitly. 

Axiom 2 has an "if and only if" clause preceded by a quantification. The 
implication in one direction is straightforward, namely that if H\ •< H 2 , 
then aHi + (l-a)H ^ aH 2 + {\-a)H for all H and all a € (0, 1] (assuming 
that the mixtures are horse lotteries.) The other direction of implication 
has more striking consequences. In words, if a horse lottery appears in two 
mixtures on both sides of a preference, then the smaller amount can be 
"removed" from each mixture without changing the preference. 

Lemma 3.113. Assume Axioms 1 and 2. Let Hi,H 2 ,H £ Ho, and let 
a, /J €(0,1). 

• Suppose that aH + (1 - a)H x ^ pH + (1 - 0)H 2 . If a > (3, then 



Ifa</3, then 



1-/3 1-0 

- 1 -a 1 -a 



19 See also Problem 25 on page 211. 

*This section may be skipped without interrupting the flow of ideas. 
20 In the finite case, Ho can be taken equal to H, since H is known to be 
in that case. 



3.3. Axiomatic Derivation of Decision Theory 191 



• Suppose that aH + (l- a)Hi -<0H + {1- &)H 2 . Ifa>0, then 
Ifa<(3, then 

Hl< ^l H + \zl H2 . 

1 — a 1 — a 

Proof. The first two statements are proved in almost identical fashion. 
We will prove only the first one. Axiom 2 says that for arbitrary 0 < rj < 1, 

rjH + (1 - V )[ 7 H + (1 - 7 )#i] :< rjH + (1 - 77)^2 

implies that jH + (1 - 7)^1 ^ # 2 . Let 77 = /? and 7 = (a - (3)/{l - (3). 

The last two statements are proved in almost identical fashion. We will 
only prove the third one. Axiom 2 and the definition of -< say that for 
arbitrary 0 < 77 < 1, 

r]H + (1 - n)[yH + (1 - 7)#i] -< f?ff + (1 - T7)# 2 

implies that 7if+(l-7)iri ^ #2- In fact, we can conclude *yH+(l— 7)^1 -< 
if 2 because, if not, then H 2 ■< 7# -f (1 — 7)#i and Axiom 2 implies 

ijff + (1 - rj)H 2 ^ V H+(1- 7])[ 7 H + (1 - 7 )#i], 

a contradiction. Now, let 77 = /? and 7 = (a — — /?). □ 

The next lemma says that a less preferred gamble can be substituted for 
a more preferred one on the left side of any •< relation. Similarly, a more 
preferred gamble can be substituted for a less preferred one on the right 
side of any -< relation. 

Lemma 3.114. Assume Axioms 1 and 2. 

• Assume that Hi,H 2 € Wo and that Hi -< H 2 . Then for every 0 < 
a < 1, every H 3 £ Ho, and every H 4 , if aH 2 + (1 - o)H 3 -< H 4 , 
then aHi + (1 — a)H 3 ^ H 4 , and if H4 ^ aH\ + (1 — oc)H^, then 
H 4 ^aH 2 + (l-a)H 3 . 

• Assume that H\,H 2 E Ho and that Hi -< H 2 . Then for every 0 < 
ol < 1, every H 3 , and every H4, if aH 2 + (1 — a)H 3 ^ H 4 , then 
aHi -f (1 - a)H 3 -< H 4 , and if H 4 ^ aH x + (1 - a)# 3 , taen #4 -« 
afr 2 + (l-a)J5r 3 . 

Proof. Suppose that aH 2 + (1 - a)#3 ^ i/ 4 and #1 ^ i/ 2 - Axiom 2 says 
that 

affi + (1 - a)H 3 1 olH 2 + (1 - a)ff 3 . 

It follows from the transitivity of •< that aHi + (1 - a)H 3 ■< H4. The 
remaining cases are all similar. □ 
It follows easily that two indifferent horse lotteries can be substituted for 
each other in all comparisons. 



192 Chapter 3. Decision Theory 



Corollary 3.115. Assume Axioms 1 and 2. Let Hi,H 2 E Hq, and assume 
H\ rsj H 2 . Then for every 0 < a < 1, every H 3 E Ho, every #4, 
aifi + (1 - a)#3 d #4 */ and on/t/ i/ aH 2 + (1 - a)#3 ^ #4, arcd #4 ^ 
a/fi + (1 - a)H 3 if and only if #4 r< aH 2 + (1 - a)if 3 . 

The next lemma says that two different mixtures of the same two horse 
lotteries are ranked according to how much probability they give to the 
better of the two horse lotteries. 

Lemma 3.116. Assume Axioms 1 and 2. Let H\ -< H 2 E Hq. Suppose 
that 

H 3 ~ aH 2 + (1 - a)H u #4 ~ 0H 2 + (1 - /?)#i. 
Tften a < 0 if and only if H 3 ^ H4. 
Proof. Suppose that H 3 ■< if 4 but 0 < a. Then 

atf 2 + (1 - a)H x 1 0H 2 + (1 - /?)#i, 

by Corollary 3.115. Since a > ft use Lemma 3.113 to conclude that #2 •< 
(0/a)H 2 + ([a - 0\/a)Hi. Use Lemma 3.113 once again to conclude that 
#2 H\, which is a contradiction. 

Next, suppose that a < 0 but H 4 -< #3. Then /?# 2 + (1 - -< 
aH 2 + (l-a)Jii, by Corollary 3.115. A contradiction follows just as before. 
□ 

A useful consequence of the first three axioms is what is often called an 
Archemedian condition. 21 

Lemma 3.117 (Archemedian condition). Assume Axioms 1-3, and 
assume that Hi, Hz E Hq. If H x -< H 2 ^ H 3 , then there exists a unique 
0 < a < 1 so that 

aH 3 + (1 - a)Hi ~ H 2 . 

Proof. Suppose that Hi<H 2 ^ H 3 . Let N 0 = {<* + (1 - r< 
if 2 }, and define 0 Q = sup{a : a E No}- Since N 0 contains 0, it is nonempty 
and 0q is well defined. Define H = 0 O H 3 + (1 - /?o)#i. For fc = 1, 2, . . ., let 
a* EN 0 be such that lim/b-oo a* = A) and define G k = a k H 3 + (l-a k )Hi. 
We have, for each r, G k (r) -> #(r) and, for each fc, G fe ^ # 2 . By Axiom 3, 
H d #2. Next, let Hi = {a : # 2 ^ <*#3 + (1 - <*)Hi], and define ft = 
inf{a : a E Hi}. Since Hi contains 1, it is nonempty and 0i is well defined. 

Define G = 0iH 3 + (1 - ft)#i. For fc = 1,2 let 7 fc E Hi be such that 

lim^oo 7* = ft and define ii* = 7*#3 + (1 - 7n)#i- We have, for each 
Sj # fc ( r ) _> G(r) and, for each fc, # 2 :< Jfc. By Axiom 3, H 2 * G. By 
Lemma 3.116, 0 O < ft • If 0o < 0 < ft , then neither # 2 =< /?#3 + (1 - 



21 In the proofs of results in the finite case, we do not explicitly use Axiom 3, 
but rather we only use the Archemedian condition. This fact would allow us to 
prove a converse to Lemma 3.117 in the finite case. 



3.3. Axiomatic Derivation of Decision Theory 



193 



nor /?if 3 + (1 - (3)H\ •< H 2 , which contradicts Axiom 1. It follows that 
/J 0 = fa and a is the common value. Clearly, any value other than a is 
either in No or Ni but not in both, so a is unique. □ 
As an aside, some people prefer to take the Archemedian condition in 
Lemma 3.117 as an axiom instead of Axiom 3. In the finite case, they are 
equivalent. 

Proposition 3.118. Assume Axioms 1 and 2 and assume that P and R 
are finite. If the Archemedian condition from Lemma 3.117 holds, then 
continuity (Axiom 3) holds. 

Lemma 3.119. Assume Axioms 1 and 2 and the Archemedian condition 
of Lemma 3.111. There exists a function U : Ho — ► 2R such that 

U(H X ) < U(H 2 ) if and only if H x * if 2 . (3.120) 

Proof. If if a ~ if 2 for all H U H 2 € Ho, then just set U{H) = 0 for all if. 
For the rest of the proof, assume that there exist H*,H* € Ho such that 
if* -< if*. We will use the Archemedian condition in Lemma 3.117 to help 
define U. For each if G W such that H* ^ if ^ H*, define U(H) equal 
to the value of a such that aif* + (1 - a)H* ~ H. Note that U{H+) = 0 
and U(H*) = I. 22 For each H such that H ^ #*, define U(H) equal to 
-a/(l - a) for that a such that aH* + (1 - a)H ~ ii*. For each H such 
that H* ■< define U(H) equal to I /a for that value of a such that 
aH + (1 - a)#* - #*. Next, we prove that (3.120) is true. 

There are six possible arrangements of H\ and H 2 relative to H* and H* 
(ignoring the permutations of H\ and H 2 themselves). Lemma 3.116 shows 
that (3.120) is true if both H\ and H 2 are between if* and H* 23 It is easy 
to see that if only one of Hi and H 2 is between if* and if* that (3.120) 
is true, since one value of U is between 0 and 1 and the other is not. Also, 
(3.120) is true if Hi ^ if* -< if* d #3-i, for i = 1 or 2. The only cases 
that remain are (i) that in which both if 1 and if2 are preferred to if* and 
(ii) that in which if* is preferred to both ifi and if 2- For case (i), we have 

** ~ z7(k) H ' + ^w H " (3121) 

Lemma 3.116 and (3.122) say that U(Hi) < U(H 2 ) if and only if 



22 In the finite case, one can prove that there exist two NM-lotteries if* and 
if* such that H* <H* for all H e H. (See Problem 33 on page 212.) For 
this reason, one can skip to the next paragraph in the finite case. 

23 The proof ends here in the finite case because we can choose ii* and if* so 
that H+^H <H* for all if G K, by Problem 33 on page 212. 



194 Chapter 3. Decision Theory 



This and (3.121) are true if and only if Hi ■< Hz by Lemma 3.113. Case 
(ii) is similar. □ 

Lemma 3.123. Assume Axioms 1 and 2 and the Archemedian condition 
of Lemma 3.117. The function U constructed in the proof of Lemma 3.119 
satisfies (3.109) for all Hi,H 2 eH 0 . 

Proof. If Hi ~ H 2 for all H e W 0 > the result is trivial. So, assume 
that there exist H* ~< H* € Wo- There are 10 cases to handle, depending 
both on how Hi and H 2 compare to H* and H* and on the value of 
c = aU(Hi) + (1 — a)U(H2). Without loss of generality, assume Hi ^ H2, 
since they are arbitrary. 24 

Case 1. H*±H U H 2 < H*. Since 

Hi ~ U(Hi)H, + [l-U(Hi)]H*, 
H 2 ~ U(H 3 )H. + [1-U(H2)]H*, 

we can use Corollary 3.115 to conclude that 

aHi + (1 - a)H 2 

~ (aU{Hi) + [1 - a]U(H 2 ))H* + (1 - aU{H{) - [1 - a]U(H 2 ))H. t 

so that (3.109) holds. 

Case 2. Hi -< H» ■< H 2 ■< H* and c > 0. In this case, c < 1 is clear, and 



H 2 



U(H 2 )H* + [1 - U(H 2 ))H„ 
U(Hi) ... 1 



■H* + 



Hi. 



l-V{Hi)" ' l-U(Hi) 
Mix Hi with weight a with both sides of (3.124) to obtain 

aH x + (1 - a)H 2 

~ aHi + (1 - a)U(H 2 )H* + (1 - a)[l - l/(# 2 )]#. 



(3.124) 
(3.125) 



= 0 



-U(Hi) 



H* + 



l-U(Hi) 1-U(H,) 



-Hi 



+ (!-/?) 



aU(Hi) + (l-a)U(H2) H , + (1 - a)[l - ^2)] ^ 



1 -a + at/(ffi) 



1 - a + af/(ifi) 



where /? = a[l - I/(Hi)], which is less than 1 because c> 0. Use (3.125) 
and Lemma 3.113 to see that the last expression is ~ cH* + (1 -c)Hi. This 
implies that (3.109) is true. 

Case 3. Hi X H* d> H 2 ■< H* and c < 0. In this case, (3.124) and (3.125) 
are still true. This time let P = (a- aV{Hi))/{\ - aU(Hi)); mix the left- 
hand side of (3.124) with weight 1 - 0 with the right-hand side of (3.125) 



24 Only case 1 is needed in the finite case. 



3.3. Axiomatic Derivation of Decision Theory 



195 



with weight /3, and mix the other sides also. The result is 

T -^ ){aHl Hl-a)H 2] + ^W- ) H* 

(l-a)U(H 2 ) 1-c 
l-aU(Hi) l-aU(Hi) 

Now use Lemma 3.113 to remove the common H* from both sides (there 
is more on the left than on the right) to get 

— !— [aHi + (1 - a)H 2 ] + t^~ h * ~ H - 
1 — c 1 — c 

It follows that (3.109) is true. 
Case 4. H\ -< Jf, H* -< H 2 and c 6 [0, 1]. In this case, 

* ~ T^m H ' + rrmr) H '- (3126) 
"* ~ uk) H ^ UJ mr H -- (3127) 

Let 0 = a(l-£/(ffi))/[a(l-£/(/fi)) + (l-a)£7(H 2 )], and take the mixture 
of (3.126) with weight 0 and (3.127) with weight 1-/3. The result is 

a(l - U(Hi)) ^ + (l-a)U(H 2 ) g . 



a(l - U(H{)) + (1 - a)tf(ff a ) a(l - t/(#i)) + (1 - a)£/(tf 2 ) 

-aU(Hi) a 

a(l -y(Hx)) + (l-a)tf(ff 2 r " t "a(l-t/(H 1 )) + (l-a)f/(/f 2 ) 1 

, (1-a) „ (l-a)(£/(ff 2 )-l) 

a(l - f/(if0) + (1 - a)U(H 2 ) 2 a(l - + (1 - a)[/(tf 2 ) *' 

One can now use Lemma 3.113 to remove a common component consisting 
of the two terms involving H* and H„ on the right-hand side with weight 

-gy(gi) + (l-a)[y(£f 2 )-l] 
a(l -{/(#!)) + (l-a)C/(tf 2 ) * 

The result says that (3.109) is true. 

Case 5. H x -< H* -< H* •< H 2 and c> 1. In this case, (3.126) and (3.127) 
are still true. This time, let /? = a(l-C/(^i))/[a(l-C/(iJi))-h(l-a)t/(if 2 )], 
and mix (3.126) with weight f3 with (3.127) with weight 1 - p to get 

7 iT + (1 - 7 )L - 7 Q [a^i + (1 - a)// 2 ] + + (1 - 7)£, 



196 Chapter 3. Decision Theory 



where 

7 = 



a(l -t/(#i)) + {l-a)U(H 2 y 

-aU(H x ) ir | -a^)-!] ^ 



a(l-2I/(Hi)) a(l-2I/(Hi)) 

Use Lemma 3.113 to remove the common component of L from both sides 
and the result is (3.109). 

Case 6. Hi,H 2 ^ In this case, c < 0 is clear and we have (3.126) 
together with 

"-r^j^ + r^BS*- (3 ' 128) 

Let /? = a(l-U(H 1 ))/[a(l-U(Hi)) + (l-<*)(l-U(H 2 ))], and mix (3.126) 
with weight (3 with (3.128) with weight 1- 0 to get 

ff. ~ -f^-H* + T-^-faHi + (1 - a)H 2 }. 
1 — c 1 — c 

This implies that (3.109) is true. 

Case 7. H* -<H\,H 2 . This is analogous to case 6. 

Case 8. Hi ■< H* -< H* •< H 2 and c < 0. This is analogous to case 5. 

Case 9. H± -< H i ^ H* -< H 2 and c> 1. This is analogous to case 3. 

Case 10. if* ^ #i -< H* -< H 2 and c < 1. This is analogous to case 2. □ 
Now, we can prove that U is bounded on Wo- 25 

Lemma 3.129. Assume Axioms 1-3. Then U is bounded on Ho- 

Proof. If H x ~ H 2 for all H U H 2 € Ho, the result is trivial, so assume 
that there exist H* ~< H* e Ho. Without loss of generality we can as- 
sume that [/(#*) = 0 and U(H*) = 1 (otherwise, just replace U by 
[U-U(H*)]/[U(H*)-U(H*)] and the preferences and boundedness are not 
changed). Suppose, to the contrary, that U(-)\s unbounded above. (A simi- 
lar construction works if U is unbounded below.) Let {H n }%Li be such that 
U(H n ) > n. Let H' n = (1 - l/n)H m + (l/n)H n for each n. Then H* X JT£ 
for all n because [/(#*) = 1 and U(H' n ) > 1 for all n, but H n -> H* and 
H* ^ H*. This contradicts Axiom 3. D 
If Ho C W contains if* and if* with if* X #*, define 26 

/?i(Wo) = sup £/(#), A(W 0 ) = inf C/(JT). 

The following lemma is useful in allowing us to find NM-lotteries with 
arbitrary utilities. 27 



25 The conclusions of Lemma 3.129 are obvious in the finite case. 

26 In the finite case, we can arrange for (3i(H) = 1 and #>(«) = 0. 

27 In the finite case, Lemma 3.130 follows trivially from the fact that there exist 
NM-lotteries L* and L* that achieve the maximum (1) and minimum (0) values 
of U and the fact that U(aL m + (1 - a)L.) = a for every a € [0, lj. 



3.3. Axiomatic Derivation of Decision Theory 



197 



Lemma 3.130. Assume Axioms 1-3. For each (3 € (/%(Wo), /?i(Wo)), ^ ere 
exists an NM-lottery L with U (L) = (3. 

Proof. If H x ~ H 2 for all H U H 2 € « 0 , then /3 0 (H 0 ) = A (Wo), and the 
result is vacuous, so assume that there exist H* -< H* e Ho with U(H+) = 
0 and U(H*) = 1. Assume, to the contrary, that a 0 = infz, 6 £[/(L) > 
flo(Wo). 28 We know that a 0 < 0, since U(H*) = 0. Let # be a horse lottery 
such that £/(#) < ao, which must exist since /3b (Wo) is the infimum of all 
utilities of horse lotteries. Let e = a 0 - so that [/(#) = a 0 - e < 0. 

Let a = a 0 /[2(a 0 - e)], which is easily seen to be between 0 and 1/2. 
Let L be an NM-lottery such that U{L) = o 0 (l/2 + q)/2, which is in 
the open interval (a 0 /2,aa 0 ). Define i/' = a# + (1 - a )H*. This means 
that U{H') = a 0 /2 < C7(L), hence £T ^ L. But tf'(r) = a#(r) + (1 - 
a)#*. We have assumed that U(H{r)) > a 0 for all r, since H{r) e C. So 
U(H'(r)) > aa 0 > U(L). This implies that L * H'(r) for all r. Axiom 5 
implies L<H\& contradiction. A similar contradiction holds if we assume 
thatsup L ?7(L)</3 1 (W 0 ). □ 
We are now in position to prove that W is itself convex. 

Lemma 3.131. 29 Assume Axioms 1-3. Let W 0 be the set of all constant 
horse lotteries. 

• For each horse lottery H e H, the function g : R -+ M defined by 
g{r) = U(H(r)) is measurable. 

• IfH ly H 2 6 W and 0 < a < 1, then aH x + (1 - a)# 2 € W. 

fw°/;^ the ^ St Paftj l6t H 6 H and let = W). We know 
that A(Wo) < fl(r) < ft (Ho) for all r. To prove that g is measurable, we 
need to show that for every c G (A)(«o)>/M#o)), {r : *(r) < c} € ^. For 

£2^5^'" NM ' lottery with " • ^ 

{r : p(r) < c} = {r : £/(tf( r )) < f/(L c )} = {r : i/(r) ^ L c } e A u 

where the second equality follows from Lemma 3.119, and the inclusion 
follows from the definition of a horse lottery 

^^1^^ ^ HuH2 ^ ^d 0 < a < 1. We need to prove 
hat for all L e £, {r : aH x {r) + (1 - a )H 2 {r) -< L\ e A\ Let i f £ 
Lemma 3.123 says that ~ ' 6 £ ' 

{r : ^(r) + (1 - a )ff a (r) ^ £,} (3b132) 
= {^:o^(Hi(r)) + (l-a)£/(tf 2 (r))<y(L)}. 



occur 

29 



£his can only happen if ft(Wo) < 0. If /? 0 («o) = 0, then a„ = ft (Wo) must 
The conclusions of Lemma 3.131 are already known in the finite case. 



198 Chapter 3. Decision Theory 



But the first part of this lemma shows that both U(Hi(-)) and U(H 2 (')) 
are measurable functions. Hence the convex combination is measurable. It 
follows that the set on the right-hand side of (3.132) is in A\. □ 
Prom now on, so long as we assume Axioms 1-3, we can assume that H 
is closed under convex combination. 

Lemma 3.133. Assume Axiom 5 and that preference is nondegenerate. 
Then there exist two NM4otteries L* and L* such that L* -< L*. 

Proof. Since the preference is nondegenerate, there exist horse lotteries 
H* -< H*. If, to the contrary, H*(r) ~ H*(r) for all r, then Axiom 5 says 
H* ^ if*, a contradiction. □ 

Lemma 3.134. Assume Axioms 1-3. Let H+ and H* be horse lotteries 
such that U(H*) = 0 and U(H*) = 1. For each B € A u define H B by 

™ - { m f™, B ' <«•»»> 

Let Q(B) = U(H B ). Suppose that H* ^ H B for all B. Then Q is a proba- 
bility. 

Proof. It is easy to see that H B is a horse lottery. It follows from iZ* •< H B 
that Q(B) > 0. It is easy to see that Q(0) = 0 and Q(R) = 1. If C and 
D are disjoint, define H = \Hc + \Hd, which equals \Hc\jd + \H+. 
According to (3.109), 

\[Q{C) + Q{D)\ = \mh c ) + U(H d )} = u(^Hc + \h d 

= U Q/Tcud + = \[U(H C ud) + U(H.)] 

= ^Q(CUD), 

from which it follows that Q{CuD) = Q{C) + Q(D). 30 Next, let {i4 n }^ =1 
be mutually disjoint subsets of R, and let 

oo n 

A=[jA u B n = \jAi. 

i=l i=l 

For every n, we have 



30 The proof ends here in the finite case, since there do not exist infinitely many 
disjoint subsets of R. Note that in the finite case Axiom 3 is not used, only the 
Archemedian condition of Lemma 3.117 is used. 



3.3. Axiomatic Derivation of Decision Theory 



199 



hence, we can choose a n e [0, 1] so that 
H n = (1 - a n ) 



+ a n H*~±H* + ±H A . 



Since we just showed that Q is finitely additive, we have 



\Q(A) 



= \u{H A ) = \u{H*) + \u{H A ) = U{H n ) 

= \(\ - a n )U(H Bn ) + ^(1 - a n )t/(tf*) + a n tf(JT) 



1 n 

= 5(l-an)J^Q(Bn)+an. 



1=1 



It follows that for all n, Q{A) = (1 - a n ) XZ?=i <2(#n) + 2a n . If we can 
show that limn-^oo a n = 0, then we have that Q is countably additive. Let 
( a n fc }£i be a convergent subsequence of {an}^! with limit a. Then 



H nk (r)->{l-a) 



±H A {r)+ l -H*{r) 



+ aH*(r) = H(r). 



It follows from Axiom 3 (with H' = H" = \H+ + \H A ) that H ~ \H* + 
But H = (1 - a)[§#* + -f It follows from Axiom 2 that 
either a = 0 or H* ~ \H* + \H A . Since this latter is clearly false, it must 
be that a = 0 and a n — ► 0. □ 

Lemma 3.136. Assume the conditions of Theorem 3.108. In Lemma 3.134, 
let H* = L* and if* = L* /rom Lemma 3.133. Then H* ^ H B for all 
B £ A\. For all B E A\, Q(B) = 0 if and only if B is null. 

Proof. The fact that Q(B) = 0 if and only if B is null follows easily from 
Axiom 4 and is left to the reader (as Problem 37 on page 213). By Axiom 5, 
U diHsdiL*. 31 □ 

Lemma 3.137. Assume the conditions of Theorem 3.108. Let H be a horse 
lottery that takes on only finitely many different NM-lotteries. Then 

U(H) = j U(H(r))dQ(r). 

Proof. Let L[, . . . , V n be the different NM-lotteries that H takes on. Let 

h = ma X {l,U(L[),...,U(L' n )} 
b 0 = wm{0,U(lA),...ML' n )}. 



31 In the finite case, the fact that L* ^ Hb ^ L* was already known without 
appeal to Axiom 5. 



200 Chapter 3. Decision Theory 



Define c\ = [6i(l - 60)] 1 and c 2 = -&o/(l - M- Clearly, c\ > 0,c 2 > 0, 
and ci + c 2 < 1. Let H" = c x H + c 2 L* + (1 - d - c 2 )L*. 32 Then 

£/(#") = Cl U{H) + C2, 
U(H"(r)) = ciU(H(r)) + c 2 , for all r. (3.138) 

Also, 0 < U{H"(r)) < 1 follows from (3.109) and simple algebra. Since 
ci ^ 0 and 

J U(H"(r))dQ(r) = Cl J U(H(r))dQ(r) + c 2 , 

it is sufficient to prove the result for H" . Since i/"(r) is the same mixture 
of H (r) and L* and L* for all r, Jf" takes on only finitely many different 
NM-lotteries also. Let H"(r) = Li for r € Bi for i = 1, ... ,n, where the 
J3i € form a finite partition of R. For each z, define Hi by 



It is easy to see that Hi is a horse lottery and that 

i H " + —L, = -H 1 +--- + -H n . 
n n n n 

Hence, U{H") = £" =1 U(Hi). Since 

U(H"(r))dQ(r) = £ U(Li)Q(Bi), 



J' 



1=1 



we complete the proof by showing that U(Hi) = U(Li)Q(Bi) for each i 

Since 0 < U(U) < 1, we know that U ~ U(U)L* + [1 - t/(Li)]£*. By 
Axiom 4, we can substitute the right-hand side of this expression for Li in 
the definition of Hi and conclude that Hi ~ H[, where 

, J U(Li)L* + [1 - f/(Li)]L* if r € B 4| 
~| if not. 

Hence, U(Hi) = 17(^0 • For each define the horse lotter y H B t 35 in 
Lemma 3.134. So, H[ = U(Li)H Bi + [l-U(Li)}L*. It follows that U(Hl) = 

U(Li)Q(Bi), as desired. D 

Lemma 3.139. 33 Assume the conditions of Theorem 3.108. Let H be an 
arbitrary horse lottery. Then U(H) = J U(H{r))dQ{r). 



32 In the finite case, a = 1, c 2 = 0, and H" = H. 
33 This lemma is not needed in the finite case. 



3.3. Axiomatic Derivation of Decision Theory 



201 



Proof. First, suppose that U(H(r)) > 0 for all r. Let H" = \H + \L* . 
Since U{H") = |C/(#) + £ and / U{H"{r))dQ{r) = £ / U(H{r))dQ(r) +£, 
it suffices to prove the result for if". Let 6i = sup r U{H"(r)). It follows 
from Lemma 3.130 that for all # < 6i there exists an NM-lottery L with 
U(L) = x. 
For each n and fc = 0, 1, . . . , n2 n , define 

Bn, k = {r:^<U{H'\r))<^). 

Define the horse lotteries H n for each n by 

H n (r) = L n ,fc for all r G B„ |fc , fc = 0, 1, . . . , n2 n , 

where L n> fc are chosen (see Lemma 3.130) so that U(L n ^) = min{&i, (k — 
l)/2 n } for k > 1 and L n ,o = L*. It follows from Axiom 5 that L* ^ I/ n> fe -< 
H"(r) for all n,Jfc and all r € hence 0 < U(H n {r)) < U(H"(r)j for 
all r,n. Since U{H n (r)) converges to U(H"(r)) for all r, the monotone 
convergence theorem A. 52 implies 

lirn^ J U{H n {r))dQ(r) = j U(H"{r))dQ(r). (3.140) 

Lemma 3.137 says that the integrals on the left-hand side of (3.140) are 
U(H n ). Since Axiom 5 says that U(H n ) < U(H") for all n, 

J U(H"(r))dQ(r) < U(H"). 

Since U is bounded above, we can choose M n ^ to be NM-lotteries such 
that U(M nt k) — min{6i,A:/2 n }. Just as above, let H n (r) — M n>fc for r € 
£ n> fc, so that U(H"{r)) < U(H n (r)) < bi for all r,n. The dominated 
convergence theorem A.57 says that 

U(H") < Yrn^ J U(H n (r))dQ{r) = J U(H"(r))dQ(r). 

It follows that / U(H"(r))dQ(r) > U{H"), and the result is proven when 
U(H(r)) > 0 for all r. 

A similar argument works if U(H(r)) < 0 for all r. For arbitrary H, let 
H+(r) = H(r) if U(H(r)) > 0 and H+(r) = L* otherwise. Let H~(r) = 
H(r) if r/(H(r)) < 0 and H~{r) = U otherwise. Then \H+ + = 
\H + |L„ and t/(#) = U(H+) + [/(#-). The result now follows. □ 

The last two lemmas prove the essential uniqueness of U and the unique- 
ness of Q. 

Lemma 3.141. Assume the conditions of Theorem 3.108. The utility U 
from Lemma 3.119 is unique up to positive affine transformation. 



202 Chapter 3. Decision Theory 



Proof. Let U[ and U 2 be two utilities. If preference is degenerate, then 
both U[ and U 2 are constant and the result is trivial. So, suppose that there 
exist if* and if* with if* ■< if*. For i = 1,2, define Ui(H) = [U[{H) - 
Ul(H.)]/[Ul(H*) - U!(H*)]. This makes Ui(H.) = 0 and Ui(H*) = 1 for 
i = 1,2 without affecting the other properties of each Ui. Now, suppose 
that there exists if such that U\(H) ^ U 2 {H). Without loss of generality, 
assume U\(H) < U 2 {H). There are five cases to consider. 34 

Case 1. 0 < Ui(H) < U 2 {H) < 1. Let U X {H) < a < U 2 (H) and let 
if' = aH*+(l-a)H*. Then U { {H ( ) = a for i = 1,2. Now U X (H) > Ui(H'), 
meaning if -« if', and U 2 (H f ) < t/ 2 (if ), meaning if ' < if, a contradiction. 

Case 2. t/^if) < C/ 2 (if) < 0. Let U X (H) < c < U 2 (H), and define 
if' = c/(c - 1)//* + (-l)/(c - Then C/ 1 (/f / ) < 0, so if' < if*, and 
U 2 {H') > 0, so if* -< if', a contradiction. 

Case 3. 1 < ^i(if) < U 2 (H). Let t/^if) < c < t/ 2 (if), and define 
H 9 = (l/c)H+(c-l)/cH.. Then ^(fi') < 1, so H' -< if*, and t/ 2 (if') > 1, 
so if * -< if', a contradiction. 

Case 4. CTi(if) < 0 < U 2 (H). Then if -< H+ < if , a contradiction. 

Case 5. (7i(if) < 1 < U 2 (H). Then if -< if * -< if , a contradiction. 

Finally, note that if C/i = C/ 2 , then U[ and C/2 are positive affine trans- 
formations of each other. □ 

Lemma 3.142. Under the conditions of Theorem 3.108, the probability Q 
is unique. 

Proof. Lemma 3.141 shows that the utility is unique up to positive affine 
transformation, so suppose that there are two different probabilities Q\ 
and Q 2 such that for both i = 1 and i = 2, ifi ■< if 2 if and only if 
JU(Hi{r))dQi(r) < f U(H 2 (r))dQi(r). Pick two NM-Lotteries L* and 
L* such that L* -< L*. Let £? be an arbitrary subset of R and define 
if 5 as in (3.135). It follows that U(H B ) = Qi{B) for i = 1,2, so that 
Q 1 (B) = Q 2 (B). □ 
Since NM-lotteries are concentrated on only finitely many prizes, the 
following is a simple consequence of (3.109). 

Corollary 3.143. Under the conditions of Theorem 3.108, if L = a\p\ + 
•••+a fc pfe, then U(L) = Yli=i otiUfa). 

The proof of Theorem 3.110 requires a lemma first. 

Lemma 3.144. Under the conditions of Theorem 3.110, ifL x ,L 2 are NM- 
lotteries such that Li X {^)L 2 , then there exists a set B such that X~ l (B) 
is null and, for all x # B, L\ < x (^ X )L 2 . 

Proof. Let L x ^ L 2 , and let B = {x : L 2 -« x Li}. Define ifi(r) = L 2 for 
all r e X-\B) and H x (r) = L x for all r £ X" 1 ^). Then Axiom 6 says 



Only case 1 is needed in the finite case. 



3.3. Axiomatic Derivation of Decision Theory 



203 



that H\ r< I/i, but Axiom 4 says that L\ ■< H\ if is nonnull. It 

follows that X~ l (B) must be null. A similar proof works if L\ < L2. □ 
Before giving the proof of Theorem 3.110, we give a brief outline. We use 
Theorem 3.108 to represent conditional preference by expected utility sepa- 
rately for each value of x. We then use Lemma 3.144 to show that the utility 
function in the conditional preference representation must equal the utility 
function for unconditional preference except on a null set. We prove that 
the probability measure for the conditional preference representation must 
equal conditional probability calculated from the unconditional preference 
by showing that if it were not, we could construct a pair of horse lotteries 
that are conditionally ordered one way for all x, but that are marginally 
ordered the opposite way, contradicting Axiom 6. 

Proof of Theorem 3.110. Let L* and L* be as in Lemma 3.133. Accord- 
ing to Theorem 3.108, since -< x satisfies Axioms 1-5 for each x, there is, for 
each a probability P x on (R,A\) and a utility U x such that H\ < x H2 
if and only if / U x {H x {r))dP x {r) < J U x (H 2 (r))dP x {r). Let Q x denote the 
distribution of X induced from Q. Lemma 3.144 says that each pair of NM- 
lotteries is ranked the same by -< x except possibly for x in a set with null 
inverse image. Since Lemma 3.136 says that a set C is null if and only if 
Q{C) = 0, we can assume that there is Bo such that Q(X~ l (B 0 )) = 0 
and U X (L*) < U X (L*), for all x & Bo. We can certainly assume that 
U X (L*) = 0 = U(L*) and U X (L*) = 1 = U(L*) for all x <? B 0 . Let £_ 
be the set of all x such that there exists L x with U X (L X ) < U{L X )^ and 
let B+ be the set of all x such that there exists L x with U X (L X ) > U(L X ). 
We will show next that for each x € B + UB_, we could choose L x so that 
U(L X ) = i For each x e £+ U let 

t> _ 1 r ■ 6l ~ 1 r 1 T* 
x " 6i(l-6o) X -M l ~bo ' 

where 60 = min{0, U(L X )} and 61 = max{l, U(L X )}. Then U X {L' X ) ^ 
U(L' X ), but now 0 < U(L' X ) < 1. By mixing L' x with either L* or L* to 
create L x , we can have U(L X ) = \ and either U X (L X ) > \ or U(L X ) < \. 
That is, we can assume that U(L X ) = \. Let L 1 / 2 = \L* 4- \L*. Define 
horse lotteries H+ and H- by 

J L x if X(r) = x and x € 
\ L 1//2 otherwise, 

J L x if X(r) = x and x € 
\ L 1 / 2 otherwise. 

Since {r : H+(r) X L} is either ii or 0 depending on whether or not 
f/(L) = ^, i/ + is a horse lottery, and similarly for H-. By construction 
H- < x L 1 ' 2 for all x and i/_ < x L 1 / 2 for all x € B_. Also, by the 
measurability condition in the definition of conditional preference, B- = 
{x : L 1 / 2 i7_} c , so B- € B. It follows from Axiom 6 that X~ l (B-) 



H+(r) = 
H-(r) = 



204 Chapter 3. Decision Theory 



is null. By a similar argument, we can show that is null. Let B' 

equal B 0 U£_U£+. Then X' l (B') is null and, for all x <£ B' , U X (L) = U(L) 
for all LeC. 

Next, 35 we prove that P X (A) is a measurable function of x for all A € A\ . 
For each A e A x , let H A (r) = L* if r e A and H A (r) = U if r £ A. For 
each ce [0,1], 

{x : P X (A) <c} = {x:H A ^ x cV + (1 - c)L,} e B 

follows from the measurability assumption on conditional preference. 

Finally, we prove that Q(-\x) and P x (.) agree almost surely. Let £> = 
{x : P x (.) ^ Q(-\x)} and B = B' U D. If we can prove that D e B and 
Qx(I>) = 0, the proof is complete. 36 Since (R,A\) is a Borel space, Ai is 
countably generated (Proposition B.43). Let {A n }£° =1 generate Ai. Then 

oo 

D = (J ({x : P x (A n ) < Q(A n \x)} U {x : P x (A n ) > Q(A n \x)}) . (3.145) 

71=1 

Since both P x (A n ) and Q(A n \x) are measurable functions, each of the sets 
in the union is in S, so D e B. 

KQx(D) > 0, then one of the sets in the union (3.145) must have strictly 
positive Q x measure. Let D' = {x : P X (A X ) < Q(A x \x)}, and suppose 
that Qx(D') > 0. For each rational q e [0,1], let D q = {x : P X (A X ) < 
q < Q(Ai\x)}. Then D' is the union of all the D q . Since this union is 
countable, there exists q such that Qx{D q ) > 0. Define H\(r) = L* for 
r e A x n X~ l (p q ) and H x (r) = L* otherwise. Also, define ff 2 (r) = £* 
for r e Ai and # 2 (r) = £* otherwise. Then t/ x (#i) = U X (H 2 ) according 
to the definition of conditional preference because H\{r) = f^O") for all 
r e X~ l (D q ). But P X (A X ) = t/^i/i) by the uniqueness of probability and 
Lemma 3.134. Define H 3 (r) = qL* + (1 - g)L* for all r € X" 1 ^) and 
#3(0 = Li, otherwise. Then the definition of conditional preference implies 
that U X (H 3 ) = U x (qL* + (1 - q)L*) = q. Since J7 a (£T 3 ) = q > U X (H X ) for 
all x € D qi we have Hi -< x if 3 for all x E D g . But Hi -< y H$ for all y £ D qi 
since #i(r) = H 3 (r) for all r ^ X _1 (D g ). It follows from Axiom 6 that 
Hi -< H 3 . Now, note the following contradiction: 

U(H 3 ) = qQx{D q ) < [ Q(Ai\x)dQ x (x) = Q({X e D q }nAi) = U(H X ), 

JD q 

where the first and last equalities follow from Lemma 3.137, the inequality 
follows from the definition of D qi and the other equality follows from the 
definition of conditional probability. □ 



35 This paragraph is not needed in the finite case. 

36 Since D G B is obvious in the finite case, the rest of this paragraph is not 
needed in the finite case. 



3.3. Axiomatic Derivation of Decision Theory 205 



3.3.6 State-Dependent Utility* 

We mentioned earlier that Axiom 4 may not be reasonable to assume. 
It may be the case that when the state of Nature changes, the relative 
values of various prizes also change. For example, if the states of Nature 
involve different exchange rates between two currencies, then the relative 
values of fixed amounts of the two currencies will change according to the 
state of Nature. For this reason, we prove a theorem that does not assume 
Axiom 4. If Axiom 4 fails, then Axiom 5 may not even be desirable, as the 
next example shows. 

Example 3.146. In this example, we will have the relative values of the prizes 
change drastically from one state to the next. Let R = {n,r2} and P — {pi,/>2}. 
Let Ui(pi) — 1 and Ui(p3-i) = 0 for i = 1,2. For NM-lottery L = api + (1 -a)p2, 
define Ui(L) = &Ui(pi) 4- (1 — ot)Ui(p2). For horse lottery H = (Li,Z/2), define 
U(H) = OAUi(Li) + 0.6U 2 {L 2 ). Consider the following two horse lotteries Hi = 
(£1,1^1,2) and # 2 = (£2,1* £2,2), where 

Li,i = P2, 1/1,2 = P2, 

£2,1 = Pi, £2,2 = ^Pl + ^P2. 

One can easily calculate U(L\,\) = U(L 1,2) = 0.6, while U(L 2 ,i) = 0.4 and 
t/(L 2 , 2 ) = 5. So H 2 (ri) -< Hi(n) for i = 1,2, but U(Hi) = 0.6 and U(H 2 ) = 
0.7, thus Hi -< H 2 - Even though each of the NM- lotteries awarded by H\ is 
marginally preferred to the corresponding NM-lottery awarded by H 2 , C/i (#2(7*1)) 
is sufficiently higher than Ui(Hi(r\)) to make up for the fact that 1/2(^2(^2)) is 
a little lower than £^2 (#1(7*2)). 

The functions Ui in Example 3.146 are called a state- dependent utility. 
Since we will have to abandon Axiom 5 (at least in its current form) in order 
to abandon Axiom 4, and since some version of dominance is essential for 
the infinite case, we will only deal with the case in which R and P are finite 
in this section. 37 

Theorem 3.147. Assume Axioms 1 and 2 and the Archemedian condi- 
tion of Lemma 3.117. Assume that preference is nondegenerate. Then, 
there exist a probability Q = (tfi, . . . ,g n ) over R = {r*i,...,r n } and a 
state- dependent utility function (C/i, . . . , U n ) such that for every H\ = 
(Li,i, . . . , Li, n ) and H 2 = (L 2 ,i, . . . , L 2 , n ), Hi ^ H 2 if and only if 

n n 
i=l i=l 



*This section may be skipped without interrupting the flow of ideas. 
37 One possible approach to dealing with the infinite case would be to assume 
the existence of a conditional preference relation {^ r : r G R} that satisfied 
Axiom 6. The type of dominance that we need in the state-dependent case is 
built into Axiom 6. 



206 Chapter 3. Decision Theory 



The Ui functions are unique up to positive affine transformation ( one for 
each i). The only property of Q determined by the preferences is that non- 
null states must have positive probability. 

The reader will note that Theorem 3.147 makes no claim of uniqueness for 
Q. It is easy to see why not. Suppose that (gi, . . . ,g n ) is a probability over 
the states and (C/i, . . . , U n ) is a state-dependent utility. Let (ti, . . . , t n ) be 
another probability such that U = 0 if and only if = 0. For each i such 
that U > 0, define V { = qdJi/U. If U = 0, set Vi = Ui. Then 

n n 

for all (Li,...,L n ). If (<Zi,...,g n ) and t/ are as guaranteed by Theo- 
rem 3.108 and (ti, . . . , £ n ) is as above, then Vi = qiU/U will satisfy 

n n 
i=l i=l 

This same construction can be applied whether or not Axiom 4 holds. 
What Axiom 4 achieves is the ability to identify a unique probability and 
state-independent utility. It does not preclude the existence of alternative 
state-dependent representations of preference. 

The proof of Theorem 3.147 resembles those parts of the proof of The- 
orem 3.108 that were relevant for the finite case. The first thing we do is 
define the state-dependent utility by means of called-off comparisons. Then 
we define a particular Q that makes U{H) = QiUi{H{ri)). We need a 
lemma first. 

Lemma 3.148. Assume Axioms 1 and 2. For each state r j} each pair 
(Li,L 2 ) of NM-lotteries, and each four horse lotteries Hi,H 2 ,H 3 ,H 4 sat- 
isfying the following conditions: 

• the preference between H x and H 2 is called-off when {rj} c occurs, 

• the preference between H 3 and H 4 is called-off when {rj} c occurs, 
and 

• tfxfo) = H 3 (rj) = Li, H 2 (rj) = H A (rj) = L 2 , 
we have Hi •< H 2 if and only if H 3 ^ H 4 . 

Proof. First, note that \H X + \H 4 = \H 2 + \H 3 . Use Lemma 3.114 to see 
that H x < H 2 implies \Hi + \H 3 < \H X + \H 4 , which implies H 3 * H A 
by Axiom 2. Similarly, H 3 ^ H 4 implies Hi < H 2 . D 
Proof of Theorem 3.147. For each state rj and each (n - l)-tuple of 
NM-lotteries (L u . . . , L.-i, L j+ i, . . . , in), consider the set of horse lotteries 
of the form 

(Li, . . . Lj+i, . . . ,£n)> 



3.3. Axiomatic Derivation of Decision Theory 



207 



where L is an arbitrary element of C. According to Lemma 3.148, the 
ranking of these horse lotteries will be the same no matter what one chooses 
for the L{S. Hence, we can treat the set of these horse lotteries as the 
entire set of interest and apply Lemma 3.119 to obtain a utility function 
Uj : L — > [0, 1] satisfying 

(Li, . . . , Lj-i, L[, Lj+i, . . . , L n ) ■< (Li, . . . , ij-i, L f 2 , • • • , L n ) 

if and only if Uj(L[) < Uj(L r 2 ), no matter what one chooses for the L;s. 
For each j such that r$ is nonnull, there are prizes p* and p*j such that 
Uj(pj) = 1 and Uj(p*j) — 0. (If r$ is null, Uj(pi) can be arbitrary, since 
there are no preferences among the horse lotteries.) It is easy to see that 
the best and worst horse lotteries are respectively 

H* = (Pi,..-,p;), #* = (p*i,...,P*n). 
Now, set up the following horse lotteries: 

1 31 \ p*j tij^t. 

Define qi = U(H*), where U is constructed by Lemma 3.119 based on all 
of H with U{H*) = 1 and U(H*) = 0. Clearly, q { > 0 for all z. To see that 
!Cr=i 9i = 1 5 note that the equal mixture of all the H*s is 

If* + ... + I/r; = If* + — 

n n n n 

Evaluating {7 at both sides of this expression gives l/nX^ =1 (ft = 1/n. 

Finally, we prove that if if = (Li, ... , L n ), then U (H) = £^=1 QiUi(Li). 
This will complete the proof. Construct n horse lotteries 



By taking an equal mixture of all n of these, we get 

-H + — = -Hi + • • • + -H n . 
n n n n 

Evaluating U at both sides of this gives U(H)/n = (1/n) Y%=i U(Hi). So, 
we need only prove that U(Hi) = qiUi(Li) for each z. From the definition 
of J7i, we see that Hi ~ H l , where 

H i( r \ = J u i( L i)P* + (1 ~ Ui(Li))p*i if i = j, 
V j; \ P*j if i ^ j. 

Since ff* = Ui(Li)H* + (1 - t^(L<))H«, 

= (J?T) = (1 - Ui(Li))U(H*) + Ui(Li)U(H;) = qJJ^Li). □ 



208 Chapter 3. Decision Theory 

3.4 Problems 



Section 3.1: 



1. *Consider the rule 6 in Example 3.13 on page 148. 

(a) Find a formula for the risk function. 

(b) Find a formula for the Bayes risk with respect to Lebesgue measure. 

(c) Prove that the Bayes risk is strictly less than 1/2 for all even n. 

(d) Find the exact value of the Bayes risk if n = 2. 

2. Prove Proposition 3.16 on page 150. 

3. Two firms are planning to make competing secret bids on the price at which 
they will supply a computer system to a government agency. The firm with 
the lower bid will get the job. (No cost overruns will be allowed.) One firm 
believes that its actual cost of supplying the system is sure to be c and it 
has a prior distribution on the bid 0 of the other firm. Let h be the cost 
of preparing and submitting the bid. For each situation below, find the bid 
this firm should make to maximize expected profit: 

(a) h — 0 and fe(0) = exp(-0//z)//i, for 0 > 0 and /i known. 

(b) h = 0 and /e is arbitrary. 

(c) h > 0 and /e(0) = exp(-0//z)//i, for 0 > 0 and \x known. 

(d) h > 0 and fe is arbitrary. 

4. An actuary wants to estimate the mean number of claims for industrial 
injuries in a newly opened factory, in order to determine the premium for 
insurance. The actuary believes that, to a good approximation, the number 
of claims in any year by any one person is Poisson with mean 0 conditional 
on a parameter 0 = 0. Different persons are assumed to be independent 
given O. Past experience with similar factories gives a prior density for 6: 

for fixed known values of m and r > 1. After n person-years have elapsed 
in this factory, s injuries are observed, 
(a) Using /e(0) above, show that with a loss function 

the best choice of the premium d is 

r 4- s — 1 



d'(s) = 



(b) Now, assume that n (the number of person-years) is fixed and treat 
S (the number of injuries) as random (not yet observed). Find the 
risk function for d*. Also find the Bayes risk for d* with respect to 
the prior fe and the posterior risk for d*(s) given 5 = 5. 



3.4. Problems 209 



5. Let Xi,...,X n be IID Ber(0) random variables conditional on O = 0. 
Suppose that we have a loss function L(0, a) = (6 — a) 2 /[6 2 (l — 0) 5 ], where 
the action space is N = [0,1]. The prior distribution of 0 is Beta(a,l3). 
Find conditions on a and f3 such that both of the following are true: 

• The formal Bayes rule exists and has finite posterior risk for all pos- 
sible samples. 

• The Bayes rule exists and has finite risk. 

6. Suppose that the conditional density of X given 0 = 0 is exp(— \x — 6\)/2 
and that 0 has prior density exp(— |0— rj\)/2 for some number tj. Let N = 1R 
and L{0,a) = (0 — a) 2 . Find the formal Bayes rule. 

Section 3.2.1: 

7. Suppose that P 0 says that Xi, . . . ,X„ are IID iV(0, 1). Let S 0 (X) be the 
median of the sample, and let T = Si(X) be the sample average. Find a 
randomized rule based on the mean, 6(T), which has the same risk func- 
tion as 6o no matter what the loss function is. (You may wish to solve 
Problem 12 on page 663 or Problem 1 on page 138 first. You probably can- 
not write a closed-form solution for the randomized rule. You may either 
describe the probability distribution in words sufficiently precise to define 
it or give an algorithm for actually performing a randomization that will 
have the appropriate distribution.) 

8. Let 6 : X — ► N be a nonrandomized rule, and let T : X — ► T be a sufficient 
statistic. Let 6i be the rule constructed in Theorem 3.18 on page 151. Show 
that for each £, the distribution 6i (£)(•) on N is the probability measure 
induced by 6 from the conditional distribution of X given T = t. 



9.*Let ft = (0,oo) x (0,oo), X = M 3 , and N = IR+. Suppose that P e says 
Xi,X 2 ,X z are IID U(a,(3), where 0 = (a,/3). Let 



Let 6 0 (X) = X. 

(a) Find a two-dimensional sufficient statistic, T. 

(b) Use the Rao-Blackwell theorem 3.22 to find a rule 6i(T) whose risk 
function is at least as good as that of 6o(X). 

(c) Find the risk functions and R(9,6i), and show that there is 
at least one 0 such that R(0,6i) < R(0,6 0 ). 

10.*Let {X n }%Li be conditionally IID Ber(9) given 0 = 0, and let X = 
(Xi,...,X n ). Let N = Q = (0,1) and L(0,a) = (0 - a) 2 . Let the prior 
distribution A of 0 be C/(0, 1). Let 6o(x) be the sample median, that is 




6 0 (x) = 



if more than half of the observations are 0, 
if more than half of the observations are 1, 
if exactly half of the observations are 0. 



210 Chapter 3. Decision Theory 



(a) Find R(0,S o ) and r(A,<5 0 ). 

(b) Let T = X)<Li Find tne ru * e guaranteed by the Rao-Blackwell 
theorem 3.22. 

Section 3.2.2: 

11. In Example 3.25 on page 154, find the risk function for both S and <5i and 
show that 6\ dominates S. 

12. Suppose that P 9 says that X ~ £m(n,0). Let ft = (0,1) and N = [0,1]. 
Let L(0, a) = (0 - a) 2 and 



Find a nonrandomized rule that dominates S. 

13. Find an example of a decision problem with a decision rule So and a prob- 
ability A on the parameter space such that So is A-admissible but So is not 
a Bayes rule with respect to A. 

14. Suppose that X ~ Exp(l/0) given 9 = 0. Let the action space be [0, oo), 
and let the loss function be L(0, a) = (0 — a) 2 . 

(a) Prove that S(x) = x is inadmissible. 

(b) Find a nonconstant admissible rule. 

15. Let X = (Xi, . . . , X n ), where the X» are conditionally IID AT(/i, a 2 ) given 
9 = Let ci > 0 and c 2 be constants. Let N = IR and L(0,a) = 
(/i - a) 2 . Define = (nx + cic 2 )/(n + ci). Show that S is admissible. 

16. *Prove Proposition 3.47 on page 162. 

17. Assume that L(0,a) = (0 - a) 2 in each of the following questions. 

(a) Suppose that X ~ JV(0, 1) given 9 = 0. Show that for each constant 
c, S(x) = c is admissible. 

(b) Suppose that X ~ t/(O,0) given 9 = 0. Show that for each constant 
c, S(x) = c is inadmissible. 

18. Let Q = (0, 1), N = [0, 1], and L(0, a) = (0-a) 2 . Suppose that P e says that 
X ~ Geo(0), that is, 



is the density of X with respect to counting measure given 9 = 0. Show 
that S(x) = x/(x -f 1) is admissible. 
19. Suppose that X ~ iV(0, 1) given 9 = 0, and let 9 have an JV(0, 1) prior. 
Suppose that the parameter space and the action space are both (-co, oo). 
Let L(0,o) = 0 if a > 0 and L(0,a) = 1 if a < 0. 

(a) Show that there is no Bayes rule. 

(b) Show that every decision rule is inadmissible. 





for x = 0, 1, . . ., 
otherwise 



3.4. Problems 211 



(c) Show that if the action space is [—00,00], then there is a Bayes rule 
and that it is the only admissible rule. 

Section 3.2.3: 

20. *Prove that the modified James-Stein estimator 63 (X) has smaller risk func- 

tion than S(X) if n > 4. (Hint: Let T be an orthogonal transformation with 
first row proportional to 1 T , and let Z be the last n — 1 coordinates of FX. 
What does Theorem 3.50 say about estimating T0 by T63(X)7) 

21. *Say that a function g : JR — ► IR is absolutely continuous if there exists a 

function g' such that for all X\ < £2, gfa) = g(xi) + f** <?'(y)dy. 38 

(a) Prove that the conclusion to Lemma 3.51 continues to hold if the 
assumption that g is differentiate is replaced by the assumption that 
g is absolutely continuous. 

(b) Prove that the conclusion to Lemma 3.52 continues to hold if the 
assumption that the coordinates of g are differentiate is replaced by 
the assumption that hi is absolutely continuous for every i. 

22. Let g(x) = -x min{c, (n - 2)/ £ t n =1 xf } be a function from JR n to HT. Let 
6*(x) = x + g(x). 

(a) Using Problem 21 above, find all values of c > 0 such that S*(x) has 
smaller risk than S(x) = x in the setting of Theorem 3.50. 

(b) Prove that for c > (n — 2)/(n + 2), 6*(x) has smaller risk than 6\(x) 
in the setting of Theorem 3.50. 

Section 3.2.4: 

23. Prove Proposition 3.58 on page 167. 

24. Let X ~ Geo{9) given 9 = 9. Let L(0,a) = (0 - a) 2 /[0(l - 6)}. Prove that 
S(x) = I{oy(x) is minimax. 

25. In Example 3.72 (see page 170), let pi — Pr(6 = i) for i = 0, 1 be a prior 
distribution. Prove that it is impossible for the Bayes risk of the minimax 
rule to be simultaneously stictly less than the Bayes risks of both action 
3 and action 1. This example shows how the minimax principle can be in 
very serious conflict with the expected loss principle. 

Section 3.2.5: 

26. Prove Proposition 3.85 on page 174. 



38 Such functions are called absolutely continuous because they have a prop- 
erty similar to measures that are absolutely continuous with respect to Lebesgue 
measure. In particular, if g is nondecreasing, then r/((a,6]) = g(b) — g(a) defines 
a measure that is absolutely continuous with respect to Lebesgue measure. 



212 



Chapter 3. Decision Theory 



27. Suppose that Po says X ~ (7(0, 1) and P\ says X ~ £7(0,7), and that the 
loss function is as in Theorem 3.87. Find all of the admissible rules under 
the conditions of that theorem. Express each rule by saying which intervals 
of X values lead to making each decision. 

28. Suppose that an observation X is to be made and it is believed that X has 
one of two densities: 



Find all of the admissible procedures according to the Neyman-Pearson 
fundamental lemma 3.87 (using the loss function stated there). Express 
the rules in terms of intervals in which each decision is taken. 

29. Prove the claim at the end of the proof of the Neyman-Pearson funda- 
mental lemma 3.87 that no element of C dominates any other element of 
C. 

30. Prove Proposition 3.91 on page 178. 
Section 3.3: 

31. Suppose that there are k > 2 horses in a race and that a gambler believes 
that pi is the probability that horse i will win (£3* =1 Pi = 1)- Suppose that 
the gambler has decided to wager an amount x to be divided among the 
k horses. If he or she wagers Xi on horse i and that horse wins, the utility 
of the gambler is \og(axi), where ci, . . . ,cjt are known positive numbers. 
Find values xi, . . . , Xk to maximize expected utility. 

32. Suppose that two agents have a common strictly increasing utility function 
U for their fortunes in dollar amounts and that their current fortunes are 
the same, xq. (So, for example, the utility of receiving an additional x 
dollars would be U(xo + x).) 

(a) Let Rbe a, random dollar amount that is strictly greater than -xo. 
If one of our agents contemplates selling R, what would be the lowest 
price at which the agent would be willing to sell it? What would be 
the highest price that an agent who did not own R would be willing 
to buy it? 

(b) Suppose that one agent receives a gift consisting of a lottery ticket 
that will pay r > 0 dollars with probability 1/2 and pays nothing with 
probability 1/2 and that both agents agree on these probabilities. 
Construct a utility function U having the property that, as soon as 
an agent receives this gift, he or she is willing to sell it at some price 
less than r/2 and the other agent is willing to buy it at that same 
price. 

33. Assume Axioms 1 and 2 and the Archemedian condition of Lemma 3.117. 



Let R = {r i , . . . , r n } and P = {pi , . . . , Pm } . Consider the set H' of all horse 
lotteries of the form (p h , . . . , jkJ. ( These are al1 horse lotte # ries w ^ ose NM " 
lotteries assign probability 1 to a single prize.) Let H*,H* eH be such 
that H* * H * H* for all H G H'. Prove that H+ < H < H* for all 

Hen. 




3.4. Problems 213 



34. Prove Proposition 3.118 on page 193. {Hint: You can use Theorem 3.147 if 
you wish.) 

35. Let 71 < 72 < 1, and suppose that Hi and H2 are horse lotteries such that 

71 #1 + (1 - 7i)#2 ~ 72#i + (1 - 72)^2. 

Assume Axioms 1-3 and prove that Hi ~ Hi. 

36. Assume all of the axioms, including Axiom 6. Show that conditional pref- 
erence given R is the same as unconditional preference. 

37. Prove the part of Lemma 3.134 that says that Q(B) = 0 if and only if B 
is null. 

38. * Assume Axioms 1-4. Let R be finite. Prove that H\(r) ■< H 2 {r) for all 

r 6 R implies H\ •< Hi. (Hint: Create a comparison between Hi and 
H' that is called-off when {n} c occurs. Use induction on the number of 
states.) 



Chapter 4 
Hypothesis Testing 



4.1 Introduction 

4.1.1 A Special Kind of Decision Problem 

Recall the setup used at the beginning of Chapter 3. We had a probability 
space (5, A, //) and a function V : S — ► V. One example of V is the param- 
eter 6. Other examples are measurable functions of O. Other V functions, 
which are not functions of 0, are possible but are rarely seen in classical 
statistics. This is true to a greater extent in hypothesis testing for rea- 
sons that will become more apparent once we study the criteria used for 
selecting tests in classical statistics. 

Definition 4.1. Suppose that we can partition V into V = Vh U Va, where 
Vh fl Va = 0- The statement that V € Vh is a hypothesis and is labeled 
H. The corresponding alternative is labeled A and is the statement that 
V e Va- If V = 0, we have ft = fl H U ft a with fl H n Sl A = 0 and V G Vh 
if and only if 0 € ^h- In this case, we write H : 0 € Qh and A : 0 € 
A decision problem is called hypothesis testing if N = {0,1} and L(v,a) 
satisfies L(v,l) > L(v,0) for t; 6 V H and 1) < L(v,0) for v € Va- 
The action a = 1 is called rejecting the hypothesis, and the action a = 0 is 
called accepting the hypothesis. 1 If we reject # but # is true, we made a 
type I error. If we accept H and it is false, we made a type // error. 



l Some authors prefer to call action a = 0 not rejecting the hypothesis. 



4.1. Introduction 215 



A simple type of hypothesis testing loss function is 

r/„ „\ j Ca Hve Vtf, (A 0 v 

L( ^ a) = \6 a if.GV^, (4 * 2) 

where Ci > Co and 60 > h- It is easy to see (see Problem 1 on page 285) 
that (4.2) is equivalent to a loss function of the same form with Co = 61 = 0, 
60 = 1, and c\ = c > 0. Such a loss function is called a 0-1-c loss function. 
If, in addition, c = 1, it is called a 0-1 loss function. More general loss 
functions than the 0-1-c loss might often seem appropriate for the type of 
problems in which hypothesis testing is used. For example, if the parameter 
is real, the hypothesis is that 6 < #o> an d c > 0, an appropriate loss might 
be 

«••*>-{(*%>. <«> 

This loss provides for penalties for choosing the wrong decision that are 
commensurate with the inaccuracy of the decision. But this loss can be 
written as \0 — 6o\ times the 0-1-c loss. By Proposition 3.47, so long as the 
risk functions of all decision rules are continuous from the left (or all are 
continuous from the right) at 9 = #o> rules admissible under the 0-1-c loss 
will be admissible under this loss. One could begin the study of hypothesis 
testing by concentrating solely on which decision rules are admissible. For 
this purpose, the 0-1 loss is sufficient. The focus of hypothesis testing, 
however, is on finding tests that meet certain ad hoc criteria to be defined 
later. 

A randomized decision rule 6 in a hypothesis testing problem can be 
described by its test function, which is the measurable function </> : X — > 
[0, 1] given by 

<j)(x) = S(x)(l) = Pr(choose a = 1\X = x). 

One should think of a randomized test <\> as follows. First, observe X = 
x, and then flip a coin with probability of heads equal to <j)(x). If the 
coin comes up heads, reject the hypothesis. Because of this interpretation, 
randomized tests are seldom used in practice. 

Definition 4.4. Suppose that V = 0. The power function of a test <j> is 
&<t>{0) = Eo<t>(X). The operating characteristic curve is p<f> = 1-/3^. The size 
of <t> is sup0 GnH P<t>{0). A test is called level a, for some number 0 < a < 1, 
if its size is at most a. A hypothesis is simple HQh is a singleton. Similarly, 
the alternative is simple if Qa is a singleton. The hypothesis (alternative) 
is composite if it is not simple. For symmetry, we also define the base of 
the test to be inf^ € n A /?</>(#)• A test is said to have floor 7 if the base is at 
least 7. 

The definitions of power function, size, level, and operating characteristic 
are all standard in classical theory, but the definitions of base and floor are 



216 Chapter 4. Hypothesis Testing 

not. Some elaboration is in order. There is a duality between hypotheses 
and alternatives which is not respected in most of the classical hypothesis- 
testing literature. The definitions of base and floor are introduced to com- 
plete the duality among the concepts usually defined. For example, sup- 
pose that we decide to switch the names of alternative and hypothesis, so 
that tin becomes ft^, and vice versa. Then we can switch tests from 0 to 

= 1 — 0 and the "actions" accept and reject become switched. The power 
function of 0 is the operating characteristic of ip, and vice versa. The size 
of 0 is one minus the base of and vice versa. The test 0 has level a if and 
only if t/> has floor 1 — a. The classical optimality criteria for tests do not 
respect this duality. That is, a test 0 may satisfy the appropriate classical 
optimality criterion for a specified hypothesis-alternative pair, but when 
the names of hypothesis and alternative are switched and the same opti- 
mality criterion is appropriate, 1-0 does not satisfy the same optimality 
criterion. (See Problem 31 on page 289 for an example.) For this reason, 
when appropriate, we will introduce new optimality criteria that are dual 
to the existing ones. 

It is easy to see that the risk function for a hypothesis-testing problem 
is closely related to the power function. If the loss function is 0-1-c, then 
the risk function is 



Now suppose that we let Q! H = Qa and = fi/j, so that hypothesis 
and alternative are switched. Also, switch the names of the actions, set 
ip — 1 - 0, and let the loss be c times the 0-1-1 /c loss function. Then the 
risk function of ij) in this new problem is 



which is easily seen to equal #(0,0). So, the risk function respects the 
duality between hypotheses and alternatives, as will considerations of ad- 
missibility. 

4.1.2 Pure Significance Tests 

A simpler framework for hypothesis testing dates back at least to Pearson 
(1900). In this simpler framework, one need only explicitly state the hy- 
pothesis (call it H as before), which is either a single distribution for the 
data or a class of distributions. One then creates a weak order < on the 
sample space X, where x * y is intended to mean that y is more at odds 
with H than x is. 2 



2 See Definition 3.99 on page 183. Basically, the binary relation ^ must 
reflexive and transitive, and all pairs of data values must be compared. 




(4.5) 




4.1. Introduction 217 



Example 4.6. Let H state that X ~ iV(0, 1). We can say that x ■< y if |ar| < \y\. 

We are quite free to define ^ however we wish, so long as it is a weak 
order. 

Example 4.7. Let H state that X ~ Af(0, 1). We could define x ^ y by |x| > 

The most common way to define ■< is in terms of a statistic T : X — ► IR. 
We would say that x :< 2/ if and only if T(x) < T(y). In Example 4.6, 
T(a?) = |#|. A /mre significance test is obtained by calculating the signif- 
icance probability ph(x) (Definition 4.8) and rejecting H if ph(x) is too 
small. 

Definition 4.8. Let the hypothesis H be a set of distributions on (X,B). 
Suppose that the quantity Pq(x) = Q({y : x < y}) is the same (or approx- 
imately the same) for all Q in H. Then the common value ph(x) is called 
the significance probability of the data x relative to the weak order < f and 
the test that rejects H when ph(x) is small is called a pure significance 
test. 

Example 4.9 (Continuation of Example 4.6). It is easy to see that 

/-\ x \ poo 
4>(y)dy+ / 4>(y)dy = 2${-\x\), 

where <j> and $ are the standard normal density and CDF, respectively. This pure 
significance test would be the same as the usual test of the hypothesis that the 
mean of a normal distribution with variance 1 is 0 versus the alternative that the 
mean is not 0. 

For the case of Example 4.7, we have 

Ph{x)= / <j>(y)dy = 2$(\x\). 

J-\x\ 

This test would lead to rejecting H if the data are too consistent with H This 

Z^nl^K^ ( u 936) did When consider ing how closely the data of 
Mendel (1866) matched a theory that Fisher later showed to be inaccurate. 

Example 4.10. Suppose that H is the set of distributions that say that X = 
{fit, ... ,X n ) are conditionally IID with N(0,a 2 ) distribution given E = <r. Let 
T(x) be the usual t statistic for testing the hypothesis that the mean of a normal 
sample is 0, namely T(x) = V*\x\/s, where x = Xi /n and s 2 = < Xi - 

fr l{n <~Si TlT?' i{ l : T{X) * T(j/ )» is the san » e for a11 <>■ In fact pm{x) = 
2r n l (-|r(x)|), where T n -, is the CDF of the t n _,(0, 1) distribution. The usual 
t-test is a pure significance test. 

The advantages to pure significance tests over general hypothesis tests are 
that one need not explicitly state the alternatives and one is free to choose 
the weak order •< however one sees fit. Of course, one would normally choose 
■6 with some alternative in mind, but one need not say what the alternative 
is, nor need one calculate any probabilities conditional on the alternative. 



218 Chapter 4. Hypothesis Testing 

A serious disadvantage is that one never knows, until one considers explicit 
alternatives, whether one should continue calculating probabilities as if the 
hypothesis were true or not. Just because p H (x) is large does not mean that 
H is a better probability model for the data than some other plausible 
distribution not part of H. Similarly, if p H (x) is quite small, it may be 
the case that many other distributions not part of H also give very small 
probability to the set {y : x < y}. Berkson (1942) forcefully argues this 
point, but not forcefully enough for Fisher (1943). 

We will not discuss pure significance tests any further in this book except 
to mention a few points. 3 First, all of the hypothesis tests developed in this 
chapter can be interpreted as pure significance tests if one feels compelled 
to do so, although the hypotheses may need to be modified in order to 
satisfy the definition of pure significance test. Second, the goodness of fit 
tests described in Section 7.5.2 were originally intended to be interpreted 
as pure significance tests. Third, pure significance tests have no role to play 
in the Bayesian framework as described in various parts of this text. If the 
hypothesis H describes all of the probability distributions one is willing to 
entertain, then one cannot reject H without rejecting probability models 
altogether. If one is willing to entertain models not in H, then one needs 
to take them into account, as well as their merits relative to H, before 
deciding whether or not to reject H. 

4.2 Bayesian Solutions 
4.2.1 Testing in General 

The Bayesian solution to a hypothesis-testing problem with 0-1-c loss is 
straightforward theoretically. The posterior risk from choosing action a = 1 
is cPt(V e Vh\X = x), and the posterior risk of choosing action a = 0 is 
Pr(V G Va\X = x). The optimal decision is to choose a = 1 if 

cPr(V 6 V H \X = x)< Vt{V e V A \X = x), 

which is equivalent to 

Pv(VeV H \X = x)<^- c . (4.11) 

So, the Bayesian solution is to reject the hypothesis if its posterior probabil- 
ity is too small, that is, smaller than l/(l+c). Theoretically, that is all there 
is to Bayesian hypothesis testing with 0-1-c loss. In practical problems, it 
may be computationally difficult to calculate the posterior probability that 
V € Vh > but this is a numerical analysis problem. 



3 Cox and Hinkley (1974, Chapters 3-5) discuss pure significance tests and 
related topics in great detail. A nice review is contained in Cox (1977). 



4.2. Bayesian Solutions 219 



Example 4.12. Suppose that Pe says that {X n }^L x are IID JV(/i,0- 2 ), where 
0 = (/i,<r) and X = (Xi,...,X„). Let V = 9 and Qh = {(^) : ^ > Mo}, 
and let L be a 0-1-c loss function. If we use the measure with Radon-Nikodym 
derivative l/<r with respect to Lebesgue measure as an improper prior, then the 
posterior distribution of M is t n -i(x, s/y/n). The formal Bayes rule is 

' 1 '^t<T-l 1 {^r c ) i 

= < o if^r-i^), 

^ arbitrary if t = T". 1 ^^), 

where £ = y/n(x — tio)/s is the usual t statistic and T n _i is the CDF of the 
t n -i(0, 1) distribution. Note that this is the usual size 1/(1 4- c) £-test of H from 
every elementary statistics course. 

The Bayesian solution, as stated above, applies to predictive hypothesis 
testing as well as to parametric testing. For example, note that (4.11) is for- 
mulated in terms of predictive probabilities. Classical theory is not as well 
equipped as Bayesian to deal with predictive hypothesis tests. 4 The closest 
the classical theory comes to dealing with predictive testing is as a predic- 
tive decision problem. The type of hypothesis constructed in Example 4.13 
is closely related to tolerance sets as described in Section 5.2.3. 

Example 4.13. In the classical setting (see (3.15) on page 149) the predictive loss 
function is first converted to a parametric loss function and then the parametric 
decision problem is solved. For the hypothesis-testing case with 0-1-c loss, 

L(v,6(x)) = cIv H {v)<t>{x) + Iv A {v)[l~<l>{x)), 

L(0,6(x)) = cP^v(Vh)(/>(x) + {1-P^v(Vh)[1-0(x)]} 

= (j>(x)[(c + l)Pey(V H ) - 1] + l - Pe,v(V H ), 

R{0A) = M0)[(c+1)P6,v(Vh)-1} + l-Pe,v(V H ). (4.14) 



Now, define 



= {^^(Vh)^}, 



oen A , 



- { !! 

d(6) = |(c+l)P* fV (Vj/)-l|. 

Now note that R{9,(j)) in (4.14) is exactly equal to e(0) plus d(0) times the 
risk function from a 0-1 loss as given in (4.5) for the hypothesis H : 0 6 Qh> 



4 The interested reader should try to extend the classical definitions of level, 
power, and so forth to the case of predictive hypothesis testing and see what 
happens. The problem arises because one usually assumes that the future data 
are independent of the past data conditional on the parameters, and all classical 
inferences are conditional on the parameters. Hence, the past tells us nothing 
about the future, and vice versa. 



220 Chapter 4. Hypothesis Testing 



If power functions are continuous at that 0 such that P<9,v(Vh) = l/(c + 1), 
then the predictive testing problem has been converted into a parametric testing 
problem. In words, we have replaced a test concerning the observable V with a 
test concerning the conditional distribution of V given G. 

Another area in which Bayesian and classical hypothesis testing differ 
dramatically is their treatment of more general loss functions. When the 
focus of classical testing is on admissible tests, then it does not matter 
which of several equivalent loss functions one uses. A Bayesian solution to 
a testing problem will depend on which loss one uses because one is trying 
to minimize the posterior risk. For example, with the loss function in (4.3), 
the posterior risk for choosing a = 0 is / J(0 O)OO )(0)(0 -0o)(IFq\x{0\x), while 
the posterior risk for choosing a = 1 is c J I^ OOi q o ](0)(0q - 0)dFe|x(0|x). 
A little algebra shows that the formal Bayes rule is to choose a = 1 if 

E(0|X = x) - 0 O > (c - 1) Pr(6 < 0 0 \X = x) {0 O - E(6|6 < 0 O , X = x)} . 

It may turn out that this is the same decision rule as is optimal with a 
0-1-c' loss for some number c'. 

Example 4.15. Suppose that X ~ N(0, 1) given 8 = 0 and that the hypothesis 
is H : 6 < 0o with loss (4.3). It is easy to see that the formal Bayes rule with 
respect to a prior /ie is to choose action a = 1 if 

/ (0 - 0o) exp(x[0 - 0 O ] - y j dfi e (0) 
> c [ (0o-0) exp(x[0 - 0o] - 7T ) d »eW- 

The expression on the left is increasing in x and the expression on the right is 
decreasing in x, so the formal Bayes rule is to choose action a — 1 if x > k 
for some number k. This rule has the same form as the formal Bayes rules with 
respect to 0-1-c' loss functions. 

In classical hypothesis testing, it is not common to recommend different 
tests depending on whether the loss is 0-1-c or of the form of (4.3) or 
anything else. In fact, very little attention is paid to what the loss function 
might be in classical testing. Were the focus solely on finding all admissible 
rules, this might not be a problem. However, once we advance beyond 
the simplest types of testing situations, the classical theory will tend to 
abandon the goal of finding all admissible rules and concentrate instead on 
finding all tests that satisfy certain ad hoc criteria. 

4.2.2 Bayes Factors 

The most striking difference between classical and Bayesian hypothesis test- 
ing arises in the treatment of point hypotheses of the form H : 6 = 0 O 



4.2. Bayesian Solutions 221 



versus A : 9 ^ 0$. When the parameter space is uncountable, prior dis- 
tributions are typically continuous. This means that the prior (and pos- 
terior) probability of 0 = 0 O is 0. In order to take seriously the prob- 
lem of testing a point hypothesis, one must use a prior distribution in 
which Pr(0 = 0o) > 0. Alternatively, one can replace the hypothesis 
with (what might be more reasonable) an interval hypothesis of the form 
H' : Q e [0o - e,0 o 4- 6]. This latter case is no different from anything 
considered already. The case of a point hypothesis has some interesting 
features, which we will explore in the remainder of this section. 

Jeffreys (1961) suggests the use of what are now called Bayes factors for 
comparing a point hypothesis to a continuous alternative. Let Pq < v for 
all 0, and suppose that one assigns probability p 0 to {Q = 0 O } and uses 
a prior distribution A on Q \ {0 O } for the conditional prior given 9 ^ 0 O . 
Then the joint density of the data and 6 (with respect to v times the sum 
of A and a point mass at 0o) is 

f Y ^( x a\ - J Pofx\e(x\0o) if 0 = 0o, 
JXM ' ; ~ I (1-Po)fx\e{x\0) if 0^0 O . 

The marginal density of the data is 

fx(x) = Po/x|e(*|0o) + (1 - Po) j fx\s{x\0)d\{0). 

The posterior distribution of 9 has density (with respect to the sum of A 
and a point mass at 0 O ) 



U 1 



/ei*<*l*) = < n- Pl )^g{£) Zlll 
where 

_ Pofx\e(x\e 0 ) 
fx(x) 

is the posterior probability of 9 = 6 0 . It is easy to see that 

Pi = _Po fx\&(x\9 0 ) 
1-Pi l-Poffx\e(x\9)d\(py (4 " 16) 

The second factor on the right-hand side of (4.16) is the Bayes factor. It 
would be the posterior odds in favor of 9 = 0 0 if Po = .5. For other values of 
po, one needs to multiply the prior odds times the Bayes factor to calculate 
the posterior odds. The advantage of calculating a Bayes factor over the 
posterior odds ( Pl /[1 - Pl ]) is that one need not state a prior odds in favor 
of the hypothesis. This might be useful if one is reporting the results of an 
experiment rather than trying to make a decision. One must still, however, 
state a prior distribution over the alternative given that the hypothesis is 
false. 



222 Chapter 4. Hypothesis Testing 



Example 4.17. Suppose that X ~ AT(0, 1) given 6 = 0 and = {Oo}, &a = 
(-oo, 0o ) U (0o, oo). Let the prior probability of the hypothesis be Pr(0 = 0o) = 
po > 0. Suppose that the conditional prior distribution of G given 6 ^ 0o is a 
measure A. It is not difficult to show that 



Pi Po 

- -exp 



1 - pi 1 - po 



If A puts positive mass on both sides of 0o, then it is easily verified that f exp(x[0— 
0o] — 0 2 /2)dA(0) is convex as a function of x and goes to oo as x — ► ±oo. So all 
formal Bayes rules will be of the form "reject H if x is outside of some bounded 
interval." 

When testing hypotheses of the form H : © = #o> the formal Bayes 
rule can be written in the form "reject H if the Bayes factor is less than 
something." It is possible to bound the Bayes factor from below when the 
likelihood function is bounded above. That is, we might be able to find a 
distribution A that would lead to the smallest possible Bayes factor. 5 This 
lower bound would give a bound on how strongly the data conflict with the 
hypothesis. 

Example 4.18 (Continuation of Example 4.17). The Bayes factor in this exam- 
ple is 



«p(-f ) / «p(*P - *>] - y ) dX W 



which is minimized (over A) by the distribution that puts probability 1 on 
that value of 0 which maximizes the integrand, which is the likelihood function. 
In this case, that would be 0 = x, and the lower bound on the Bayes factor is 
exp(-[x-0 o ] 2 /2). For example, if x = 0 O + 1.96, which is the critical value for the 
usual two-sided level 0.05 test of H, we get a lower bound of 0.1465. This says 
that a data value that would just barely lead to rejecting H at level 0.05 could 
not possibly change one's odds against the hypothesis by more than a factor of 7, 
and then only in the extremely unlikely case that one believed before seeing the 
data that 0 was sure to equal 0 O + 196 if it was not 0 O . Put another way, in order 
for the posterior probability of H to be as low as 0.05, the prior probability p 0 
would have to be lower than 0.2643, and much lower if a more reasonable prior 
on the alternative were used. 

A more realistic expression of prior opinion might be that the prior, given 
0 ^ 0o, is a normal distribution with mean 0 O and some variance r . In this case, 
the Bayes factor is 

^»<-wf) <"*> 

The prior in this class that leads to the smallest Bayes factor can easily be shown 
(see Problem 7 on page 286) to be the one with 

2 _ f (x-0 o ) 2 -l if k-0o| > 1, 
r | 0 otherwise. 



5 For more discussion of this technique, see Edwards, Lindman, and Savage 
(1963). 



4.2. Bayesian Solutions 223 



The minimum Bayes factor is \x — 0o|exp({— [x — Oo] 2 4- l}/2) if \x — 6 0 \ > 1. 
The minimum is 1 if \x — 0o| < 1. At x = 0o + 1.96, the minimum Bayes factor 
is 0.4734. This time, the prior probability po would have to be lower than 0.1 in 
order for the posterior probability to be as low as 0.05. 

Intermediate to the two bounds above is the bound obtained by supposing 
that 8 has distribution symmetric around 0o, but not necessarily normal. Since a 
prior that is symmetric around 0o is a mixture of priors that put probability 1 /2 
on two points symmetrically located around 0o, the smallest value of the Bayes 
factor among all symmetric priors can be obtained by maximizing over priors 
that put probability 1/2 on 0 O ± c for c > 0. For such a prior, the density of the 
data given that the hypothesis is false equals 

(2VS)- 1 [exp(-M^) + exp(-l^il!)' . 

Maximizing this as a function of c leads to c = 0 if \x - 0 O | < 1. If \x - 9 0 \ > 1, 
the maximum occurs at the solution to the equation 

x-Op + c 

For |x - 0 O \ > 1.5, the solution c is very nearly equal to \x - 6 0 \ (although 
it is always strictly smaller than \x - 0 0 \). If a; = 0 o + 1.96, for example, then 
c = L958. The value of f x (x), when c = \x-0 o \, is [l+exp(-2|x-0 o | 2 )]/[2v/2¥]. 
If x - 0 O + 1.96, for example, then the lower bound on the Bayes factor is 0.2928, 
approximately twice the global lower bound. This is not surprising, since the 
two-point distribution puts half of its probability very nearly at the same point 
as does the one-point distribution that led to the global lower bound. The other 
half of the probability is on a point that contributes nearly nothing because it is 
so far from x. 

The global lower bound on the Bayes factor, namely 

fx\e{x\9 0 ) 

sup«, # e 0 fx\e(x\6)' (420) 

is closely related to the likelihood ratio test statistic, which is discussed in 
Section 4.5.5. 

Upper bounds on Bayes factors are usually harder to come by. This is 
due to the fact that there are often priors (even conjugate priors) that 
place such high probability on the data being very far from what was 
observed that the hypothesis will be highly favored if such a prior is used 

m?* alt 2 e x rnative - For sample, in Example 4.17, if the alternative prior 
is N{0 o ,r ), the Bayes factor goes to oo as r 2 goes to oo. In this regard 
it is important to note that improper priors are particularly inappropriate 
tor the conditional distribution of 6 given 6 ^ 0 O . The limit as r 2 goes 
to oo in Example 4.17 leads to an improper prior. As we just noted, the 
Bayes factor goes to oo because the improper prior for the alternative says 
that 6 has probability 1 of being outside of every bounded interval. Since 
the data will surely be inside some bounded interval, it will appear to be 
much more consistent with the hypothesis than the alternative. There are 
ways, however, to use limits of proper priors in Bayes factors. 



224 Chapter 4. Hypothesis Testing 



Example 4.21 (Continuation of Example 4.18; see page 222). Suppose that we 
wish to let t 2 go to oo in the N(0o,r 2 ) prior for 6 given the alternative. In order 
to use an improper prior to approximate a proper prior in this problem, we would 
have to let the prior on the hypothesis be improper also. This could be done by 
letting po go to zero in such a way that por — ► k. In this case, po/[l — po] times 
the Bayes factor converges to fcexp(— [x — 0o] 2 /2). It this way, k acts like the 
prior odds ratio, and exp(-[x - 0o] 2 /2) acts like the Bayes factor. In fact, k is 
the limit (as r — * oo) of the ratio of po to the prior probability that 9 is in the 
interval [-y/n/2, y/n]2] given the alternative. (See Problem 8 on page 287.) 



By restricting the class of prior distributions, one can obtain useful upper 
bounds on Bayes factors. For example, in Example 4.17, one could restrict 
attention to priors with r 2 < c. Since the Bayes factor is increasing as a 
function of r 2 , we get that the maximum occurs at r 2 = c. For large c, 
one can easily compute the upper bound to be approximately yfc times the 
global minimum Bayes factor. 

Bayes factors can also be calculated in cases in which the hypothesis is of 
the form H : g(Q) = g(0o) versus A : #(9) ^ g(0o) for some function g. For 
example, the hypothesis might concern only one of several coordinates of 
9. In this case, global lower bounds on the Bayes factor are not particularly 
useful. 

Example 4.22. Let 9 = (M, E), and suppose that Xi, . . . , X n are conditionally 
IID given 9 = (/i, a) with N{ii,<j 2 ) distribution. Suppose that H : M = fio 
is the hypothesis. Given M ^ ix 0 , we suppose that E 2 ~ r _1 (ao/2, bo/2) and 
that M given E = a has 7V(/x 0 , <t 2 /A 0 ) distribution. This is the usual conjugate 
prior distribution. Conditional on M = /io, we still need a prior distribution of 
E 2 . We will use the conditional distribution given M = /zo obtained from the 
joint distribution given M ^ xzo- Conditional on M = /xo, E 2 has r _1 (a5/2, 6q/2) 
distribution, where aj = a 0 + 1 and bo = b 0 + A 0 (/io - /i 0 ) 2 . The conditional 
density of (X, E) given M = /n 0 , /x,e|m(z, & Imo), equals 



VV (»+oS+D eX p {-^2 [« + w + ~ } • 



(2i0*r(£) 

Given M ^ ixo, the joint density of (X, M, E) is 

2(^vs r i l o+w+ Ai( _ t )S + _ Mo)2 i \ 



(2 7 r)^r(^) 
where 



1 n 2 

r n = -^~^Xi, w = ^T^(xi - x n ) 2 > 

U i=l i=l 

. i nx n + A 0 /i° 

Ai = Ao + n, /x = ^ • 



4.2. Bayesian Solutions 225 



If we integrate the parameters out of the two densities above and take the ratio, 
we get the Bayes factor: 



bl **rfr)r(j) 
^(bifi b?r(%)r(f) 

where 



(4.23) 



, f TlXo . Q\2 

ai = ao 4- n, 6i = &o 4- w 4- -r— (x n - I* ) , 

Ai 

= aS + n, 6j = bo + w 4- n(x n - /io) 2 . 

To put a lower bound on the Bayes factor, we first note that the conditional 
distribution of M given E and M ^ // 0 which will lead to the largest marginal 
density for X given M ^ fio is the one that says M = x with probability 1. We 
are then left with the problem of finding distributions for E given M = /zo and 
given M 7^ ^o - It is easy to see that if we let the distribution of E be concentrated 
at the same value c for both the hypothesis and the alternative, then the Bayes 
factor is exp(— n(x — /io) 2 /[2c 2 ]), which goes to 0 as c goes to 0, unless x = /io- 
If x — /io (a probability 0 event given G), the lower bound on the Bayes factor 
is still 0, but one achieves this by letting the priors for E be different under the 
hypothesis and alternative. 

If one wished to use improper priors, one would have to let Ao go to 0 while 
Po/VXo converges to some finite strictly positive number k. 6 In this case oS = ao 
instead of ao 4- 1 because E and M are independent in the improper prior. To 
convert (4.23) to the case of the improper prior, we set ao = — 1 and 6o = 0. The 
product of the prior odds and the Bayes factor becomes 



ky/n 



where t = y/n(x—fj,o)/^/w/[n — 1] is the usual t statistic used to test H : M = /io- 

In general, minimizing a Bayes factor for a problem like the one in Ex- 
ample 4.22 would require choosing the prior for the alternative to maximize 
the predictive density and choosing the prior for the hypothesis to minimize 
the predictive density. But this latter problem was already seen to lead to 
the minimum being 0 in most cases. In short, the global lower bound on 
the Bayes factor, when the hypothesis concerns only a function of the pa- 
rameter, will most likely be 0 and so is not useful. An alternative to the 
global lower bound is an approximate Bayes factor formed by maximizing 
the marginal density of X separately under the hypothesis and alternative 

suP*€ft„ fx\e(x\0) 
sup eeQA fx\e(x\9)' 



6 This approach was suggested in personal communication with Luke Tierney. 
It is also the approach taken by Robert (1993). 



226 Chapter 4. Hypothesis Testing 



This approximate Bayes factor is also closely related to likelihood ratio 
tests (see Section 4.5.5). 

Example 4.25 (Continuation of Example 4.22; see page 224). To maximize the 
marginal density of the data under the alternative, we choose the prior distribu- 
tion to concentrate all of its probability on the values for a and (jl which provide 
a maximum for the likelihood function. These are clearly \i = x and a = y/w/n. 
Under the hypothesis, we must choose a to maximize the likelihood, and the 
appropriate value is cr = yjw/n + (x - fio) 2 . This approximation corresponds to 
letting Ao = 0, bo = = 0, and ao = a£ = 0 in the analysis with the conjugate 
prior. The approximate Bayes factor would then equal 

where t = y/n(x— fio) / \Jw /[n — 1] is the usual t statistic used to test H : M = fio. 

An alternative approximation to the Bayes factor is available by approx- 
imating the marginal densities of the data under the alternative and hy- 
pothesis using the method of Laplace. 7 That is, approximate the product of 
likelihood times the prior by a multivariate normal density with the wrong 
normalizing constant. Then approximate the integral over the parameter 
space by the integral of a normal density. For example, suppose that un- 
der the hypothesis, the parameter is \I> with prior density f^(ip) and that 
under the alternative, the parameter is 0 with prior density /e(#); The 
likelihoods are /x|*(x|V0 an d fx\e( x \9)> respectively. Assume that $ and 
9 provide the largest values of the two likelihood functions. If the maxima 
occur at points where the partial derivatives of the likelihoods are 0, and 
if the likelihoods have continuous second partial derivatives, then we can 
write 

log/x|*(z#) « log/ X |*(:r|*) + -[</>- *] T ^ ~ *]' 

where A is the matrix of second partial derivatives of the logarithm of the 
likelihood evaluated at This matrix will typically be negative definite. 
Let a^p = -A~ l . A similar expression is obtained for 6. 

\ogf X \Q(x\e) « log/xieMe) + l -[9 - G} T B[e - 6], 

Let a e = -B~ l . If, in addition, the prior densities are relatively flat in the 
regions where the likelihoods attain their largest values, we can write 

J fx\9(x\1>)M1>)W « 

7 We will discuss the large sample properties of the method of Laplace in 
Section 7.4.3. (In particular, see Theorem 7.116 and the ensuing discussion.) 
Here, we give only a description of the method without any rigorous justification. 
The derivation presented here is based on Kass and Raftery (1995). 

8 The matrix -A is sometimes called the observed Fisher information. 



4.2. Bayesian Solutions 227 



x I exp (-\[1>- *] T ^ 1 b - *]) # 



= /♦(*)/*|*(*l*)(2*)*k*l a , 

where k is the dimension of the vector A similar expression is obtained 
for the integral over 9. The Bayes factor is 

//x|*(*lV0/*M# ^ ^ /«(»)/;c|»(*|tt)K|* 
//xieW/eWdf /e(e)/x|e(x|e)|a tf |i ' 

The factor /#(#)//e(©) can be removed and multiplied times the prior 
odds po/[l — po] to capture the prior input required. The rest of the ap- 
proximate Bayes factor does not require the specification of any prior dis- 
tributions. The removed factor, however, is not entirely prior-dependent. 
It also depends on the observed data. 

Example 4.28 (Continuation of Example 4.22; see page 224). In the case of 
testing H : M = /x 0 , we have k = 1 and p = 2 in (4.27) because 9 = (M, E) and 
# = E. The likelihood functions have their maxima at # = y/[w 4- n(x - /io) 2 ]/n 
and 0 = (x, yjw/n). The matrices <ty and a© are 



2 



w + n(x - u 0 ) 

ff * = — w — • *• 



_ w ( 1 0 \ 



The approximate Bayes factor is l/y/2n times the factor /*(*)// e (8) times the 
expression in (4.26) times the ratio of the square roots of the determinants of the 
two matrices above. The result is 

, , - , 

111 * (4.29) 



/e(B)\/27n^ 



where t = v^O* ~ Mo)/ y/w/[n- 1] is the usual t statistic for testing H. 

To see how the approximation compares with the actual Bayes factor, suppose 
that the prior distributions are conjugate, as on page 224, and we let /* be the 
conditional prior calculated from / e given that H is true. Then 



2 (a)"*" 

r(£) P V 



an 

/e(M,a) = 2 aL^ g -Co^ exp ( ' - *> + X ^ Z ^ \ 

Plugging a = * into the first of these and ( M ,<r) = (x, ^/w/n) into the second 
and taking the ratio give 



*>< 6 > ^ft)'r(i) *•:*' exp (, *(«) 



228 Chapter 4. Hypothesis Testing 

If n is large, the exponential term above can be approximated by 



1 + 



a a+ n a n+ w 

f x 6o + Ao(/io-/x 0 ) 2 \"^~ 
V n* 2 / 



If we substitute this into (4.29) and notice that 1 + t 2 /[n - 1] = V 2 /(w/n), we 
get 

v^(*)*V£r(£)*«-> ^ ^^y-^ ^ 

n(bg)^r(^) t ? 
(6o)*V555r($) (b;)^' 

where fei, 6J, oi, and a[ are defined after (4.23). If A 0 and a 0 are small relative 
to n, we can approximate n/y/2 by v% + nr(aI/2)/r(ai/2). With this approx- 
imation, the expression above becomes exactly (4.23). Although /*(^)//e(B) 
depends on the data, one could calculate values of the ratio for a range of plau- 
sible priors to see how much it could reasonably vary. 

As an example, suppose that = 1.5 and that n = 14, x = 2.7, and w = 41 
are observed. Then $ = 2.0901 and 6 = (2.7,1.7113). That portion of (4.29) 
that does not depend on the prior is 

n-l 

-£= ( 1 + = 0.0648. 

Next, we let /i° = 1.5 and let the other hyperparameters a 0 ,b 0l \o be elements 
of the set {0.1,1,5,10,20}. Figure 4.30 shows the 125 different values of the 
logarithm of the ratio /*(^)//e(9) with A 0 varying most rapidly and a 0 varying 
most slowly. Since log(0.0648) = -2.736, those priors corresponding to values on 
the vertical axis greater than 2.736 (horizontal line) will lead to Bayes factors 
greater than 1, while the others lead to Bayes factors less than 1. Examining 
Figure 4.30, we see that many reasonable priors (those with small to moderate 
values of a 0 and A 0 and values of b 0 /a 0 in the vicinity of the observed sample 
variance w/n = 2.93) give values for the log of the ratio near 2.736. This suggests 
that the data will not dramatically alter anyone's opinion very much as to whether 
or not M = 1.5. The other approximate Bayes factor (4.26) is 0.0608, which 
suggests a significant reduction to the odds in favor of the hypothesis. The t 
statistic for the usual classical test would be 2.53, and the hypothesis would be 
rejected at level 0.05. 

An interesting difference between Bayes factors and the results of classical 
hypothesis tests arises from the comparison of the various lower bounds on 
the Bayes factor to the significance probability (see Definition 4.8). We 
will generalize the concept of significance probability in Section 4.6 and 



4.2. Bayesian Solutions 229 



l 

E 



"J 




























V 


// 





-i 1 1 1 1 1 r~ 

O 20 40 60 SO lOO 1 20 

Prior Sequence 



Figure 4.30. Logarithms of Ratios of Prior Densities 

then make detailed comparisons with Bayes factors. At this point, we only 
mention that the results can often be in stark contrast. In particular, data 
with very small significance probability (supposedly suggesting that the 
data do not support the hypothesis) can have relatively large values for the 
lower bounds on the Bayes factor (suggesting that the data do not conflict 
with the hypothesis to a great extent). This conflict is sometimes called 
"Lindley's paradox" [see Lindley (1957) and Jeffreys (1961)]. 

If one believes that the probability is 0 that the parameter lies in a low- 
dimensional subset of the parameter space, then it is not appropriate to 
test the types of hypotheses we have considered in this section. As an alter- 
native, one can calculate a measure of how far the parameter of interest is 
from the hypothesized low-dimensional subset. For example, if the parame- 
ter is 6 = (M, E) and the hypothesis is that M is near 0, then the posterior 
distribution of |M| contains a great deal of information about how far M is 
from 0. Also, the posterior distribution of |M|/E contains information of a 
similar sort. 

E T?£ e 4 - 31 ' Su ?P° se that X ~ Nifi, 1), given 9 = 0, and that our hypothesis 
is that W is near 0 O . If we consider all prior distributions of the form 9 ~ N(0 r 2 ) 
then the posterior distribution of 9 is N([0 0 + t 2 x]/[1 + t%t 2 /[1 + t 2 ]) For each 
o > 0, we can calculate 



Pr(|9 - 0 O | < 6) = $ (X[6 0 + (\[0 0 

where A = t/VI + t 2 , and $ is the standard normal CDF. There is no useful 
upper bound on this probability, but a lower bound is obtained by letting A 



230 Chapter 4. Hypothesis Testing 



1, which means r 2 — ► oo. This corresponds to the usual improper prior. After 
observing X = x, one could plot Pr(|9 — Qo\ < 6) (using an improper prior) as 
a function of 6 to describe how far 6 is likely to be from 0o. For example, if 
x = 0o + 1.96, then Pr(|6 - 0 O | < 6) = 0.05 for 6 = 0.3983. 

For multiparameter problems, there may be many possible summaries of 
the parameters that measure the extent to which the parameter differs from 
the hypothesis. We will consider a very general class of such summaries in 
Section 8.2.3. 



4.3 Most Powerful Tests 

As we saw earlier, the power function of a test is closely associated with 
the risk function for a 0-1-c loss. It makes sense, then, that most attention 
in classical hypothesis testing focuses on the power function. The following 
definitions begin to introduce the criteria by which tests are evaluated in 
the classical framework. 

Definition 4.32. Suppose that ft = fl H U {0i}, where 0i g A level a 
test (j> of H : 0 € SIh versus A : 9 = 0i is called most powerful (MP) level 
a if, for every level a test i/j, < 0<f>(Oi)' 

The corresponding dual criterion is the following. 

Definition 4.33. Suppose that ft = ft A U {0 O }, where 0 O £ SIa- A floor a 
test (p of H : 6 = 0 O versus A : 6 € ft a is called most cautious (MC) floor 
a if, for every floor a test V, /M^o) J> M^o)- 

For more general cases, we have the following definitions. 

Definition 4.34. A level a test 0 is uniformly most powerful (UMP) level 
a if, for every other level a test V>, /fy(0) < 00(0) for all 0 € ft^- A floor a 
test 0 is uniformly most cautious (UMC) floor a if, for every other floor a 
test ^, A/,(0) > (0) for all 0 G ft//. 

In some cases, both criteria (UMP and UMC) lead to the same optimal 
tests. In some cases, they do not. (See Problem 31 on page 289.) Either 
way, there is asymmetry in these definitions. A different criterion is used for 
protecting against one type of error than that used for protecting against 
the other. One argument given for the particular choice is that type I error 
is more costly than type II error, so we arrange for the maximum type 
I error probability to be small. However, what often happens is that the 
probability of type II error can become even smaller for most values of the 
parameter. Here is a simple example. 

Example 4.35. Suppose that X - Poi(0) given 9 = 0, and that ft = {1,10}. 
We are interested in testing H : O = 1 versus A : 9 = 10. The MP level 0.05 test 



4.3. Most Powerful Tests 231 



is (see Proposition 4.37 ahead) 

{0 if x < 2, 

0.5058 if x = 3, 
1 if x > 4. 

The probability of type II error for this test is 0.0065, which is much smaller than 
the probability of type I error. 

In Example 4.35, we protect ourselves more against the less costly error 
than against the more costly error. If type I error is more costly, it might 
make sense to minimize it using the UMC criterion and let the probability 
of type II error be a bit larger. 

Example 4.36 (Continuation of Example 4.35; see page 230). The MC floor 
0.05 test of H : 0 = 1 versus A : 6 = 10 is 

{0 if x < 4, 

0.4516 if x = 5, 
1 if x > 6. 

The probability of type I error for this test is 0.00198. This test provides more 
protection against the more costly error, while keeping the probability of the 
other error at a low level. 

Another alternative is to try to balance the costs of the two types of error 
in a deliberate fashion. For example, Lehmann (1958) offered the suggestion 
that one decrease the required level as the sample size increases so that the 
power would decrease also. Schervish (1983) suggested that the size of the 
test be matched to the power function at an alternative chosen based on 
substantive grounds. 

Not many theorems can be proven about MP and UMP tests in general 
without some assumptions of additional structure. One general result is al- 
ready familiar to us. The Neyman-Pearson lemma 3.87 provided a minimal 
complete class for this decision problem. For convenience, we restate that 
result here using the language of hypothesis testing. 

Proposition 4.37 (Neyman-Pearson fundamental lemma). Let ft = 

{0o, 01 } and let P e < v, for some measure v and both values of 9. Let 
fi(x) = dP e Jdv{x) for i = 0, 1. Let H : 6 = 0 O and A : G = 0 X . 

For each k € (0, oo) and each function 7 : X -> [0, 1], define the test 

f 1 iffi(x)>kf 0 (x), 

0fc,<y(*)=< l{x) tf/i(x) = fc/ 0 (x), 

I 0 iffi{x)<kf 0 (x). 

Also define the two tests 
<f>o(x) 

<j>oo(x) 



{ 1 «//iW>o, 

I 0 tf/i(z) = 0, 

[ 1 iffo(x)=0, 

1 0 z//o(x)>0. 



232 Chapter 4. Hypothesis Testing 



All of these tests are MP of their respective levels and MC of their respective 
floors. 

Note that 0oo will have size 0 because it never rejects H when /o > 0. On 
the other hand, fa will have the largest possible size for an admissible test, 
equal to 1 in many problems but not always. 

The following result gives conditions under which MP tests are essentially 
unique. 

Lemma 4.38. Let Q. = {#o>0i} and let Pq <C v, for some measure v and 
both values of 6. Let fi(x) = dPoJdv(x) for i = 0, 1, and let 



B k = {x:f 1 (x) = kf 0 (x)}. 

Suppose that for all k G [0, oo], Po.(Bk) = 0 for i = 0, 1. Let <p be a test of 
H : 6 = 0o versus A : 0 = 9\ of the form 



and let xj) be another test such that ^(Oo) = /?^(0o)- Then either 4> = xp, 
a.s. [P 0i ] for i = 0, 1 or > 

Proof. Let 4> and ^ be as stated in the lemma. Define A > = {x : ip(x) > 
(j)(x)} and A< = {x : xp(x) < <t>(x)}. Clearly, </> is in the form of a test from 
Proposition 4.37, and so it is MP of its size. Also, since </> only takes on the 
values 0 and 1, we have 



Because of the way A> and A< are defined, the sum of the two left-hand 
sides in (4.39) is /3 0 (0i) - and the sum of the two right-hand sides 

is fc[/fy(0 o ) - M 0 o)] = 0. Now, assume that P 9l (<K x ) = ^W) < L Jt 
follows that P 6l (A> U A< ) > 0. Since P^ (B fc ) = 0, we have that at least one 
of the inequalities in (4.39) is strict. In this case /^(0i) > /fy(0i). Finally, 
assume that P<? O (0PO = i/>(X)) < 1. It follows that either Pe 0 (A>) > 0 
or Pa (A<) > 0. In the latter case, P 9x {A<) > °' and we have just proven 
that MOi) > MOi)- K Peo(A>) > 0 and P* 0 (A<) = 0, then </> > <t> with 
strict inequality on a set of positive P^ 0 probability, which would contradict 

M0 o ) = M0o)- D 



<t>(x) = 



1 iffi(x)>kf 0 {x), 
0 otherwise, 



A> C {x: 4>{x) = 0} = {x : /i(a;) < */<>(*) }, 
A< C {x:0(x) = l} = {x:/!(x)> fc/ 0 (x)}. 



It follows that 




4.3. Most Powerful Tests 233 



Example 4.40. Let X ~ N(0, 1) given 6 = 0, and let U = {0 O , Then B k is 
a singleton set for every k and P$ i (Bk) = 0 for every k and i = 0, 1. 

Example 4.41. Let X ~ U(0,0) given 6 = 0, and let H = {0 o ,0i}. If /c = 
0o/0i, then is a set with positive probability under both p0 o and P$ l . So, the 
conditions of Lemma 4.38 are not met in this example. 



4.3.1 Simple Hypotheses and Alternatives 

In a simple-simple testing problem, the parameter space has only two 
points in it, and so the risk function has only two values. This makes it 
particularly easy to compare the risk functions of all tests at once. Each 
test corresponds to a point in two-dimensional space. Each coordinate is 
the risk function evaluated at one of the parameter values. For definiteness, 
let n = {0 o ,0i} and let Sl H = {0 O }, so that Q A = {0i}. Let a 0 stand for 
M0 O ) and oti for 1 -/fy(0i) for an arbitrary test (f>. Then the risk function 
of </> is represented by the point (a 0 ,ai) € [0, l] 2 . The risk set, as defined 
in Definition 3.71, is the set of all possible (a 0 ,ai) points. 

Example 4.42. Suppose that X ~ N(0, 1) given 6 = 0 and ft = {0, 1}. Accord- 
ing to the Neyman-Pearson fundamental lemma 4.37, the MP tests of H : 6 = 0 
versus A : B = 1 are those that reject H when exp(-[z - l] 2 /2) > fcexp(-x 2 /2) 
for various values of A; £ [0,oo]. This inequality simplifies to x > c for arbitrary 
c € [-oo, oo]. For each c, we get a point in the risk set with 

a 0 = P 0 (X > c) = 1 - $( c ) = $(-c), 
oti = Pi(X<c) = $(c-l). 

A plot of these points is given in Figure 4.44. 
Figure 4.44 has several features that are typical of all risk sets. 

Lemma 4.43. The risk set for a simple-simple hypothesis-testing problem 
is closed, convex, and symmetric about the point (1/2, 1/2). It also contains 
that portion of the line q x = 1 - q 0 lying in the unit square. 

PROOF. The rule <j>(x) = a 0 corresponds to the point (c* 0 , 1-q 0 ) for each a 0 
between 0 and 1, so the risk set contains the portion of the line a x = 1 - a 0 
which lies in the unit square. 

Suppose that (a, 6) is in the risk set. The symmetrically placed point 
about (.5, .5) is (1 - a, 1 - b). If 0 produces the first point, then 1 - <j> 
produces the second point. Hence the risk set is symmetric about (.5, .5). 
Lemma 3.74 shows that the risk set is convex. 

To show that the risk set is closed, we need only show that it contains 
its lower boundary. The rest of the boundary is included by symmetry and 
convexity (and the fact that the line a x = l-a 0 is in). By Proposition 3.91, 
a Bayes rule exists for every prior. By Lemma 3.96, every point on d L is 
the risk function for one of these Bayes rules; hence d L is in the risk set. □ 



234 Chapter 4. Hypothesis Testing 




iXO OZ OA 0,8 1.,<* 



Figure 4.44. Risk Set for Testing H : 9 = 0 versus 4:6 = 1 with X ~ JV(0, 1) 



The definition of the lower boundary di of the risk set (see Defini- 
tion 3.71) is designed so that exactly those tests which produce points 
on &l are admissible. The lower boundary also consists solely of Bayes 
rules. (See Lemma 3.96.) Proposition 3.91 tells us that if a Bayes rule with 
respect to some prior is not in Ol, then one of the two prior probabilities is 
0 and there is another Bayes rule with respect to that prior that is in 8l- 
These considerations lead to the following result. 

Lemma 4.45. 9 // a test </> is MP level a for testing H : 6 = 60 versus 
A :Q = 0i, then either /3</>(0i) = 1 or /?0(0o) = ol. 

Proof. We will prove the contrapositive. Suppose that both #/>(0i) < 1 
and /3</>(0o) < <*• Create another test </>' as follows. Let A = {x : (p(x) < 1}. 
Note that P$ l ( A) > 0, since /? 0 (l9i) < 1. Let g c (x) = min{c, l-</>(*)}. Then, 
for c > 0, hi(c) = E ei g c (X) > 0 and hi(c) is nondecreasing in c. Also, it is 
easy to see that hi(c) is continuous in c, since \hi(c) — hi(d)\ < \c — d\. It 
follows that there exists c such that h 0 (c) = a - P<f,(0o)- Let ft = <\> + g c . It 
follows that /fy(0 o ) = Also, </>'(x) > </>(x) for all x 6 A. Since P Ql { A ) > °> 
it follows that hi(c) > 0 and (3<t>>(0i) > ^(fli). So 0 is not MP level a. □ 
Lemma 4.45 says that a test that is MP level a must have size a unless 
all tests with size a are inadmissible. This result allows us to say when the 
two optimality criteria (MP and MC) are equivalent in the simple-simple 
testing situation. 



9 This lemma is used in the proofs of Lemmas 4.47 and 4.103. 



4.3. Most Powerful Tests 



235 



Proposition 4.46. Suppose that ao and ot\ are both strictly between 0 and 
1. Suppose that a test (j) of H : 0 = 0q versus A : 0 = 0\ corresponds to 
the point (ao, ol\) in the risk set Then <j> is MC floor 1 — oc\ if and only if 
it is MP level ao . 

The reason for the restriction that ao and ot\ be strictly between 0 and 
1 is that many tests with a 0 € {0, 1} or a x G {0, 1} are inadmissible even 
though they may satisfy one of the two optimality criteria. 

The following lemma allows us to conclude that if (j) is an MP level a test 
and we switch the names of hypothesis and alternative, then 1-0 becomes 
an MC floor 1 - a test in the new problem (see Problem 30 on page 289 
for a more general version.) 

Lemma 4.47. If<j> is MP level a for testing 11:9 = 00 versus A:S = 0 lf 
then 1 - (j) has the smallest power at 0\ among all tests with size at least 

1 - a. 

PROOF. First, note that 1 - </> has size at least 1 - a. Next, suppose that 
0<j>(Oi) = 1. Then 1 - 0 has power 0 at 0 X and is clearly the least powerful 
of any class to which it belongs. By Lemma 4.45, the only other case to 
consider is that in which /fy(0 o ) = In this case, 1 - <f> has level 1 - a. 
Suppose, to the contrary, that /fy(0 o ) > 1 -a and fo{Pi) < /?i- 0 (0i). Then 
Pi-A°o) < a and A-^(^i) > which contradicts the assumption 

that <j> is MP level a. n 

The following lemma says that in the comparison of two MP Neyman- 
Pearson tests, the one with the smaller level will also have the smaller 
power. 

Lemma 4.48. 10 Let {P e : 0 e ft} be a parametric family. If fa is a level 
a x test of the form of the Neyman-Pearson fundamental lemma 4.37 for 
testing H : 0 = 0 O versus A : 0 = 0 U and if fa is a level a 2 test of that 
form with a x < a 2 , then P^{9{) < /^ 2 (0i). 

Proof. By the Neyman-Pearson fundamental lemma 3.87, both fa and 

02 are admissible. If fcfa) > /^(^ then fa is inadmissible. □ 
The Neyman-Pearson fundamental lemma 4.37 tells us all of the admis- 
sible MP and MC tests. We also saw (Theorem 3.95 and Lemma 3.96) that 
these are the tests corresponding to points on 0 L , the lower boundary of 
the risk set, and they are the Bayes rules with respect to positive priors and 
one Bayes rule for each of the priors that assign 0 probability to one of the 
parameter values. The usual classical approach to choosing one of the ad- 
missible tests is not to choose a prior distribution and then take the Bayes 
rule, but rather to choose a value of a and then choose the MP level a test 
In cases with simple hypotheses and simple alternatives, the classical and 
Bayesian procedures will agree. That is, for every prior distribution, there 

10 This lemma is used in the proof of Theorem 4.56. 



236 Chapter 4. Hypothesis Testing 



is a formal Bayes rule and a such that the formal Bayes rule is MP level 
a. Similarly, for every a, there is a prior such that the MP level a test is 
a formal Bayes rule. Only in cases more complicated than those described 
so far can we distinguish these two approaches. Example 4.49 is one such 
case. 

Example 4.49. Suppose that X\ and X2 are conditionally independent U (0, 0) 
random variables given 0 = 0 and Q = {1,2}. Let CIh = {1} and ft a = {2}. 
Suppose that Z is independent of X\ and X2 given G with distribution Ber{\/2). 
Hence Z is ancillary. Suppose that we observe Z and X = maxi <i< n Xi where 
n = 1 if Z = 0 and n = 2 if Z = 1. That is, the sample size is random (n = Z + 1) 
but ancillary. The marginal densities of X (given 0 = 1,2) are 



/i(x) = -+xif0<x<l, 
/ 2 (x) = I + |if0<x<2. 



The MP level a test based solely on X is 11 <t>(x) = 1 if f2(x)/f\(x) > c for some 
c. This becomes 

^ > c, or x > 1. 

For 1/3 < c < 1/2, the first inequality is x < (1/4 - c/2)(c - 1/4). For c < 1/3, 
the inequality is x > 0, and for c > 1/2, the inequal ity is x > 1. So, the MP level 
a test for 0 < a < 1 is <f>(x) = 1 if x > 1 or x < (\/l + 8a - l)/2. The power of 
this test is #*>(2) = (9 + 4a + >/l + 8a)/16. 

This test may seem odd because it picks 9 = 2 for small values of x. An 
alternative test is formed by conditioning on the ancillary Z. If Z = 0 (n = 1), 
the MP level a conditional test is tp(x) = 1 if x > 1 - a. Actually, we could have 
chosen x > 1 together with any interval of length a, but the test is easier to write 
if we put the interval next to x > 1. Si milarly , if Z = 1 (n = 2), then the MP level 
a conditional test is ip(x) = 1 if x > y/l - a. (Once again, the ratio of densities is 
constant for all x < 1, so we could have chosen any set with probability a given 
0 = 1.) The power of this conditional test can be calculated to equal (5 + 3a)/8, 
which is always smaller than the power of the test <j>. So the test conditional on 
the ancillary is inadmissible. 

The fact that the MP level a test conditional on the ancillary is inadmissible 
has led some [e.g., Bondar (1988)] to conclude that we should not condition on 
the ancillary in such problems. This mistaken conclusion is due to the fact that 
the conditional test being compared has conditional size a given all values of 
the ancillary. The lesson should be that we should not fix the size to be the 
same for all values of the ancillary, but we should continue to condition on the 
ancillary. If we have a prior (ir u ir 2 ) with tti + tt 2 = 1, then the Bayes rule^ill 
be drastically different depending on whether Z = 0 or Z = 1 is observed. A 
complete characterization of the Bayes rules (j>^ for all values of m is as follows. 



11 This test is not the MP level a test based on the joint distribution of the 
data(X,Z). . 

12 These Bayes rules will be the MP level a tests based on the joint distribution 
of (X, Z) according to the Neyman-Pearson fundamental lemma 4.37. 



4.3. Most Powerful Tests 237 



We always have (f>m (x) = 1 for x > 1. For 0 < x < 1, the value of </> Vl {x) is 



7Tl 


z = o 


Z = 1 


<l 


1 


1 


1 

5 


1 


arbitrary 


between ^ and | 


1 


0 


i 

3 


arbitrary 


0 


>i 


0 


0 



Notice that the Bayes rule is never the same as either the unconditional level a 
test or the conditional level a test when 0 < a < 1. That is, there is no value of 
7ri strictly between 0 and 1 such that the Bayes rule rejects H for some (but not 
all) 0 < x < 1 for both Z = 0 and Z = 1. The Bayes rule either rejects H for all 
values of 0 < x < 1 for at least one of the Z values or it rejects H for no values 
of 0 < x < 1 for at least one of the Z values. The power functions of the Bayes 
rules are 

/?(i) m 



<i 

l 

5 

between ^ and | 
l 

3 

^ 1 



1 

7±a 
8 
7 
8 

5+2a 
8 
5 
8 



Here a is any number between 0 and 1 corresponding to the "arbitrary" parts of 
some of the Bayes rules. Note that for each a between 0 and 1, there is a Bayes 
rule with size a. (For example, to get a = 0.05, let a = 0.1 in the fourth row of 
the table. One such test is the following. If Z = 1 is observed, 0i/ 3 (x) = 1 for 
x > 1, and if Z = 0 is observed, <t>i/ 3 {x) = 1 for x > 0.9.) Since the Bayes rule 
with size a is the MP level a test, it has higher power than the unconditional 
size a test. (See Problem 15 on page 287.) On the other hand, it does not have 
conditional level a given the ancillary. 

This example illustrates a conflict between two principles of classical 
statistics. The principle of conditioning on ancillaries together with the 
principle of choosing MP level a tests leads to the MP conditional size 
a test. This test is dominated by the MP unconditional size a test that 
ignores the ancillary. This test, in turn, is dominated by the unconditional 
size a test that makes use of the ancillary. The natural conclusion is to use 
this last test. But if we are to make use of the ancillary, aren't we supposed 
to condition on it? And if we condition on the ancillary, aren't we supposed 
to use the conditional size a test? The reason that it is difficult to justify 
(in the classical framework) using the size a test based on the whole data is 
that once the ancillary is observed, the conditional size of the test changes 
depending on the value of the ancillary. Why should an ancillary affect 
my choice of the size of the test? This begs the more important question, 
"How should the size of a test be chosen in a particular problem?" There 
are no general decision theoretic principles that lead one to be able to 
choose the size of a test based on a loss function or a prior distribution. 
There are cases in which one can find a simple correspondence between the 



238 Chapter 4. Hypothesis Testing 



size of a test and a loss function, but these seem to depend on additional 
structure not present in all problems, or they seem to be isolated instances 
not easily generalized. (See Theorem 6.74 on page 376 for a description of 
some additional structure and Example 4.61 on page 241 for an isolated 
example.) 



4.3.2 Simple Hypotheses, Composite Alternatives 

The next most complicated testing situation taken up by the classical the- 
ory is that of a simple hypothesis versus a composite alternative. 13 It is 
clear that, even from a decision theoretic perspective, a UMP level a test 
will have no larger Bayes risk than any other test whose size is a, no matter 
what prior distribution we use, so long as fi// = {#o} an d 0 = {0o} U Qa- 
(See Problem 16 on page 287.) 

Example 4.50. Suppose that X ~ N(6,l) given 0 = 0 and ft = [0o,oo) 
with SIh = {0o }• For each 0i G Qa, the MP level a test is <p(x) = 1, if 
fx\e( x \@i)/fx\e(x\Oo) > k. We can calculate the ratio 



fx\e(x\9i) ( /a . N1 

= exp{x(0i - 0 O )} exp 



/x|e(*|0o) 
This is greater than k if and only if 



t logfc + ±(0?-0jj) 0i+0o , logfc 
x > t = » n = ~ h 



0i-0o ~ 2 0i-0o 

(If we had a 0-1-c loss and the prior probability of 0* were 7r», we would obtain 
this same test as long as k = cno/iTi.) The size of the test (j> t {x) = 1 if x > t is 
a t = 1 - $(* - 0 O ). So, t = 0 O + - a t ). For fixed a, the MP level a test of 

H versus A x : 6 = 0i is 0 a (x) = 1 if x > 0 O + ^ _1 (1 - <*). Notice that this is the 
same test for every 6\. Hence this test is UMP level a for testing H versus A. 
Also notice that the conditions of Lemma 4.38 are met in this example, so that 
the UMP level a test is also the unique MP size a test for each 0i € SIa and 
hence the unique UMP level a test. 

Now, consider a Bayesian approach in which the prior distribution satisfies 
Pr(6 = 0o ) = Po > 0. Suppose that the conditional prior distribution of 8 
given 6 > 0o is a measure A. It is not difficult to calculate the Bayes factor (see 
Section 4.2.2) in this case as 

(1 - po )p r( e = 0o|x = *) = / _ej\ I" r e J x[e _ e 0 ] - ^ dx(e) 

poPr(6^0o|X = x) P V 2j[Je 0 \ 2 J 

Since 0 > 0 O in the exponent inside the integral, the Bayes factor is a decreasing 
function of x no matter what A is. Hence, the formal Bayes rule (with 0-1-c 



13 By dealing with this case next, we postpone the issue that arises when the 
size of the test is the supremum of the power function over the hypothesis rather 
than just the value of the power function at the hypothesis. This subtle point is 
actually at the root of the asymmetry between hypotheses and alternatives. 



4.3. Most Powerful Tests 



239 



loss) will be to reject H if x > t is observed, where t is the largest x so that 
Pr(0 = Oq\X = x) > 1/(1 -h c). In this case the prior distribution determines t, 
which in turn determines the size of the Bayes rule when thought of as a size 
a test. Each Bayes rule is a UMP level a test for some a, but the particular a 
(even for fixed c) depends on the prior. Alternatively, for a fixed prior, each value 
of c will lead to a different a, but the correspondence between a and c (although 
monotone for each prior) will be different for different priors. 

The dual concept of UMC test would apply if the hypothesis were com- 
posite and the alternative were simple. It would then be the case that, if 
we switched the names of hypothesis and alternative, 1 minus the UMP 
level a test would become the UMC floor 1 — a test. (See Problem 30 
on page 289.) This case is interesting in that it is never discussed in the 
hypothesis-testing literature because the classical theory is not equipped 
to deal with it. If the power function is continuous (as it is in most of 
the examples that we can calculate) the power of an MP level a test of 
H : © < #o versus A : 6 = 6q will be a. The optimality criterion gives 
us no way to choose between level a tests. They all have the same power 
on the alternative. Clearly, though, some are better than others. In fact, 
those that have smaller power functions for 0 < 0o are better. But the MP 
criterion does not accommodate such comparisons. 



4.3.3 One-Sided Tests 

Next we consider the case in which both H and A are composite. For 
now, we will proceed along the classical UMP level a lines. We begin by 
introducing a concept that makes UMP and UMC tests come out the same. 

Definition 4.51. If Q C IR, X C IR, and dPo/dv(x) = fx\e( x \0) f° r some 
measure z/, then the parametric family is said to have monotone likelihood 
ratio (MLR) if, whenever 0\ < 0 2 , fx\e{%\02) / fx\e( x \Qi) is a monotone 
function of x a.e. [P$ 1 + Pq 2 ] in the same direction for all pairs of 6\ and 
62- A parametric family has increasing MLR if the ratio is increasing for 
all #i < $2) and it has decreasing MLR if the ratio is decreasing. 

Example 4.52. Suppose that X has a one-parameter exponential family distri- 
bution with 6 being the natural parameter, /x|e(#|0) = c(6)h(x) exp(0x). Then 

fx\e(x\e 2 ) c(0 2 ) s (a a u 
fx\e(x\0i) c(0i) 

which is increasing in x for all 0\ < 02 . 

Example 4.53. Suppose that f X \e(x\0) = (?r [l 4- (x - 0) 2 ])~\ the Cauchy 
distribution with location shifts. Then 

fx\e(x\6 2 ) = l + (x-0!) 2 
/x|e(*|0i) l + (x-0 2 ) 2 ' 



240 Chapter 4. Hypothesis Testing 



This ratio goes to 1 as x approaches either oo or -oo, but it is not constantly 1 
and hence is not monotone. The same problem occurs with Cauchy distributions 
having only a scale parameter. However, the absolute value of a Cauchy random 
variable with a scale parameter does have MLR. (See Problem 17 on page 287.) 

Example 4.54. Suppose that fx\e(x\0) = 1/0 for 0 < x < 0. For 0 2 > 0i, write 
the likelihood ratio as 



fx\e{x\0 2 ) 
/x|e(*|0i) * 



undefined if x < 0, 

£ ifO <*<fli, 

oo if $i < x < 0 2 , 

undefined if x > 02. 



The two "undefined" regions can be ignored because the ratio need only be 
monotone in x a.e. [P 0l 4-P^ 2 ]. The "undefined" regions have 0 probability under 
both P$ l and Pe 2 . The likelihood ratio is then seen to be monotone increasing for 
every 0i < 0 2 , although it is not strictly increasing. It only takes on two different 
values. 

Proposition 4.55. If g is measurable, monotonic, and one-to-one andY = 
g(X) and the family of distributions ofX has MLR, then so does the family 
of distributions of Y. 

Theorem 4.56. // {Pq : 6 e ft} is a parametric family with increasing 
MLR, then every test of the form 

!1 ifx>x 0 , 
7 ifx = x Q , (4.57) 
0 if x < xo 

has nondecreasing (in 9) power function. Furthermore, each such test is 
UMP of its size for testing H : 6 < 6q versus A : 6 > 0q, no matter what 0o 
is. Finally, for each a e [0, 1] and each 6q G fl, there exists xo 6 iRU{±oo} 
and 7 G [0, 1] such that the test <\> is UMP level a for testing H versus A. 

Proof. Let 6\ < 8 2 € fl. The Neyman-Pearson fundamental lemma 4.37 
says that the MP test of Hi : 6 = 9\ versus A 2 : 6 = 0 2 is 

fi if /£12gW>i b> 
/x|e(z|0i) ' 

Because the parametric family has MLR increasing, we can write 0 as 

(4.58) 



(Note that we have put x = t and x = t in the <p(x) = j(x) category. 
It may be required to set *y(t) = 0 and/or y(t) = 1, depending on the 




4.3. Most Powerful Tests 



241 



CDF of X 
1.0 T 




a — a* 



o 



x 



Figure 4.59. Step in Proof of Theorem 4.56 



values of fx\e{t\6 2 )/fx\e{t\9i) and fx\e(t\9 2 )/fx\e(i\0i) if X does not 
have continuous distribution.) Let <p be of the form (4.58) and set a' = 
/?</>(0i). Let <t> a '(x) = and since <f> is MP, it follows that 0^(02) > ot! . 
Hence, we have shown that <p has nondecreasing power function. 
Now, let a G [0, 1] be given, and let 



Then a* = Pe 0 (xo, oo) < a and Pe 0 ({xo}) > a - a*. (See Figure 4.59.) Let 



(Note that xq — — oo is possible for a = 1 and an unbounded distribution. 
In this case, Po 0 ({xo}) = 0, a* = 1, and 7* = 0.) Let 0 be of the form 
(4.58) with t = x 0 = t and 7(^0) = 7*. Then (3<f>(0o) = a and <f) is MP level 
a for testing Ho : 6 = 0q versus A\ : 6 = 0\ for every 0\ > 0o, since it is 
the same test for all 0\. Hence, (f> is UMP level a for testing Ho versus A. 

Finally, let tp be any test having level a for H. Then tp has level at most 
a for Zfo> and by Lemma 4.48, 0 is at least as powerful as rp at all 0 € A. 
Since /3tf>(0) is increasing, <p has level a for H, so it is UMP level a for 
testing H versus A. □ 

Definition 4.60. A test such as <p in Theorem 4.56 is called a one-sided 
test A hypothesis of the form H : 9 < #o or H : 0 > #o is called a 
one-sided hypothesis. 




inf{x : P0 O (— 00, x] > 1 - a} if a < 1, 
inf{x : P0 o (-oo,z] > 0} if a = 1. 




ifP, o ({x 0 }) = 0, 
ifP, o ({x 0 })>0. 



Example 4.61. Suppose that X ~ Poi{0) given 9 = 0. Then fx\&(x\0) = 
exp(— 0)0 x /x\ for x = 0, 1, . . . is the conditional density of X with respect to 



242 Chapter 4. Hypothesis Testing 



counting measure. For 0\ < 02, 



fx\e(x\9 2 ) 
fx\e(x\0i) 



which increases in x. Hence, this family has MLR increasing. The UMP level 0.05 
test of H : 0 < 1 versus A : 0 > 1 is 



A little algebra shows us that v = 3 and 7 = 0.506. 

As a possible Bayesian solution to this problem, suppose that we have a 0-1-c 
loss and the prior for 0 is F(a, b). Then the posterior given X = x is Y{a+x, 6+1). 
Since Gamma distributions are stochastically larger the larger the first parameter 
is (for fixed second parameter), we conclude that Pr(0 < \\X = x) < 1/(1 4- c) 
if and only if x > xo for some xq. For the improper prior with a = b — 0 
(corresponding to density dO/6) and with c = 19 (so 1/(1 + c) = 0.05), the value 
of xo is 4. This is the same as the UMP level 0.05 test except for the randomization 
at x = 3. 

There are also versions of Theorem 4.56 for decreasing MLR and for 
hypotheses of the form H : G > 0q. 

Proposition 4.62. // {Pq : 0 G ft} is a parametric family with decreasing 
MLR, then any test of the form 



has a nondecreasing (in 0) power function. Furthermore, it is UMP of its 
size for testing H : 6 < 0 O versus A : 6 > 0 0 , no matter what 0 O is. 
Finally, for each a 6 [0, 1] and each 0 O e Q, there exists x 0 e IR U {±00} 
and 7 e [0, 1] such that the test <p is UMP level a for testing H versus A. 

Proposition 4.64. // {P 0 : 0 € ft} is a parametric family with increasing 
MLR, then any test of the form (4.63) has a nonincreasing (in 0) power 
function. Furthermore, it is UMP of its size for testing H : G > 0 O versus 
A : G < 0 O , no matter what 0 O is. Finally, for each a G [0, 1] and each 
0 O G ft, there exists x 0 G IRU {±00} and 7 G [0, 1] such that the test 0 is 
UMP level a for testing H versus A. 



(1 if x > v, 
7 if x = v, 
0 if x < v, 



where v and 7 are chosen so that 




x=v+l 




(4.63) 



4.3. Most Powerful Tests 243 



Proposition 4.65. // {Pe : 0 e Q] is a parametric family with decreasing 
MLR, then any test of the form (4.57) has a nonincreasing (in 0) power 
function. Furthermore, it is UMP of its size for testing H : 9 > 0o versus 
A : G < 6o, no matter what 0 O is. Finally, for each a £ [0, 1] and each 
0 O e Q, there exists x 0 E JRU {±00} and 7 6 [0, 1] such that the test <p is 
UMP level a for testing H versus A. 

The proofs of these propositions are all very similar to the proof of The- 
orem 4.56. The tests in these propositions are also called one-sided tests. 

The following simple result shows that one-sided tests also minimize the 
power function on the hypothesis among tests with the same size. 

Corollary 4.66. 14 Let the family of distributions have MLR, and suppose 
that the hypothesis is either H : © < 0 O or H : Q > 0 O . A one-sided UMP 
level a test (/> satisfies (3<t>(0) < (3^(0) for all 0 6 fin and all tests t/> such 
that foiOo) > a . 

Proof. Let tp satisfy /?^(0 O ) > a, and let 0 e fl//. As in the proof of 
Theorem 4.56, we can show that a MP level 1 - a test of H f : 0 = 0 O 
versus A' : 0 = 0 is any one-sided test with size 1 - a. Since 1 - <j> is 
a one-sided test with size 1 - a and \j) has level 1 - a, it follows that 
Pi~<p(0) > A-* (0), hence < 0^(0). □ 

The following result is a simple consequence of Lemma 4.38, and it gives 
conditions under which UMP tests are essentially unique. It will also be 
useful in showing that there are situations in which there are no UMP tests 
of H : 0 = 0 O versus A : © ^ 0 O . 

Proposition 4.67. Suppose that {P e : 0 e fl} is a parametric family with 
Pe < ^ for all 0 and with increasing MLR. Let fx\e('\0) denote dP e /du. 
Let 0o E ft, and define 

Be* = {x : f X \e(x\0) = kf x]e (x\0 o )}. 

Suppose that for all 0 > 0 O and all k e [0, 00], P 0 (B e , k ) = 0. Let <\> be a test 
of H : 0 = 0o versus A : © > 0 O of the form (4.57), and let 1> be another 
test such that ^(0 O ) = fy(0 o ). Then either = a.s. [P e ] for all 0 > 0 O) 
or there exists 0 > 0 O such (3^(0) > (3^(0). 

There are versions of Proposition 4.67 for decreasing MLR and for al- 
ternatives of the form A : 0 < 0 O , but we will not state them here. An 
example of one of these propositions is the first part of Example 4.50 on 
page 238. 

The following is a complete class theorem for the case of MLR. 



14 This corollary is used in the proof of Theorem 4.68. 



244 Chapter 4. Hypothesis Testing 



Theorem 4.68. Let the action space be N = {0, 1}. Suppose that a para- 
metric family has MLR and that there exists 0q £ fl such that 

[L(0,l)-L(9,O)](9 o -9) (4.69) 

has the same sign for all 9 =^ 9q. Then the appropriate family of one-sided 
tests is an essentially complete class. 

Proof. Consider the case in which (4.69) is always positive and the family 
has increasing MLR. Let the hypothesis be H : 6 < 0q. Let <j> be any test. 
Then 

R{8, 0) = j {<!>{x)L{6, 1) + [1 - 0(x)]L(0, 0)} f X \e{x\0)dv{x) 
= L(fl, 0) + [LiO, 1) - L(0, 0)] y 0(s)/^e(s|0)di/(s). 
So, if 0 and ^ are any two tests, then 

R(0, V) - R(0, 4>) = [L(0, 1) - L(fl, 0)] J [i>(x) - <j>(x))f X \ e (x\9)dv(x) 

= [L(9,l) - L(6,0))IM8) ~ 0*(9)\- 

Now let ip be an arbitrary test. We must show that there is a one-sided test 
0 which is at least as good as ip. Let 0 be a one-sided test with /?4>(0o) = 
/fy(0o). Then Theorem 4.56 and Corollary 4.66 give us that /fy(0) - /^(0) 
has the same sign as 0 O - 0- This implies that R(9, V>) > ^(0, 0) for all 0. 

For the other cases (decreasing MLR and/or (4.69) always negative), 
use one of Propositions 4.62-4.65 together with Corollary 4.66 to obtain a 
similar result. a 

We can use Theorem 4.68 to help prove that in an MLR family with 
a one-sided hypothesis, one-sided tests are UMC as well as UMP. (See 
Problems 23 and 24 on page 288.) 

Proposition 4.70. 15 Let (j> be a one-sided test as in Theorem 4.56 or in 
Propositions 4.62-4.65 for a one-sided hypothesis versus the corresponding 
one-sided alternative in an MLR family. Suppose that the base of <f) is 7. 
Then </> is UMC floor 7. 

Suppose that we have a prior distribution fie on the parameter space. 
If there is a test with finite Bayes risk and the loss function is bounded 
below, then Theorem 4.68 allows us to conclude that one-sided tests are 
formal Bayes rules. (See Problem 33 on page 289.) For the case of 0-1-c 



15 This is not precisely the same as Corollary 4.66. Corollary 4.66 says that the 
power function is minimized on the hypothesis subject to the power function at 
0 O being at least a. If a power function is monotone but not continuous, the base 
of the test might be different from its size. 



4.3. Most Powerful Tests 245 



loss, the result follows in a simpler fashion from the fact that the posterior 
probability of a semi-infinite interval is a monotone function of the data in 
MLR families. 

Theorem 4.71. Suppose that {Pq : 0 € fi} is a parametric family of dis- 
tributions for X with MLR and that fie is an arbitrary prior distribution 
on (fi,r). Then the posterior probability given X = x that 0 is in a semi- 
infinite interval is a monotone function of x. 

PROOF. We will prove that if the family has increasing MLR then the 
posterior probability that 8 > 0o is a nondecreasing function of x. The 
other cases are all similar. Let x\ < x 2 . Then 

Pr(9 > 0 O \X = x 2 ) Pr(6 > 6p\X = x x ) 
Pr(0 < 0 O \X = x 2 ) Pr(0 < 9 0 \X = x x ) 



f^eo) fx\e(x2\eWe(e) ^ f X \e(xi\0)dv*{0) 

( ( fx\e(x2\0)dfjie(e) [ fx\e(xi\eWe(0)) 

W(-oo,do) J(-oo,0 0 ) J 



I I [fx\e{x 2 \e2)fx\e{xi\ei 

J[9 0 ,oo) J(-oo,0 0 ) 



-Zx|e(a:2|fli)/x|e(a?i|fl2)]d/ie(ffi)d/ie(tf2) 



(4.72) 



Since the family of distributions has increasing MLR, it follows that 
fx\e(x2\0 2 )fx\e(xi\8i) - /x|e(*2|0i)/x|e(*i|02) > 0, 

for all xi < x 2 and all 0\ < 0 2 . This makes the last expression in the last 
line of (4.72) nonnegative, and the result is proven. □ 

Corollary 4.73. Suppose that {Pe : 0 G fi} is a parametric family of 
distributions for X with MLR and that fiQ is an arbitrary prior distribution 
on (n,r). Suppose that we are testing a one-sided hypothesis against the 
corresponding one-sided alternative with a 0-1 -c loss. Then one-sided tests 
are formal Bayes rules. 

When no UMP level a test exists, one option that has been suggested is 
to find a locally most powerful test. 

Definition 4.74. Let d be a strictly positive function on f2^. We say that 
a level a test <j> is a locally most powerful level a relative to d (LMP) if, for 
every level a test ip, there exists c such that /?</>(0) > 0^(0) for all 0 such 
that 0 < d(0) < c. 



246 Chapter 4. Hypothesis Testing 



For a one-dimensional parameter 6 and H : 0 < 6o, if power functions 
are continuously difFerentiable and <j> is the unique level a test with maxi- 
mum derivative for the power function at 0q, then <f> is LMP level a relative 
to d(0) = 0 - 0 O for 0 > 0 O • (See Problem 34 on page 289.) 

4.3.4 Two-Sided Hypotheses 

Two-sided testing situations come in two forms. 

Definition 4.75. If H : 0 X < 9 < 0 2 and A : 9 > 0 2 or 0 < 0 U then the 
alternative is two-sided. If H : 9 > 02 or 9 < 9\ and A : 6\ < 9 < 02, 
then the hypothesis is two-sided. 

The only difference, from a decision theoretic viewpoint, between two- 
sided alternatives and two-sided hypotheses is where the endpoints go. The 
tradition in classical hypothesis testing is to put the endpoints into the hy- 
pothesis. That is, the hypothesis is closed and the alternative is open. There 
is no need to require this, especially in cases in which the power functions 
are continuous. However, the treatments of these two cases are drastically 
different in the classical framework. The difference has nothing to do with 
where the endpoints go, but rather with the asymmetric treatment of hy- 
potheses and alternatives in the optimality criteria. 

In this section, we consider the case of two-sided hypotheses. We put 
off the case of two-sided alternatives until Section 4.4. Some mathematical 
lemmas are needed first. 

Lemma 4.76 (Lagrange multipliers). 16 Let T be any set, let f :T -* 
]R and gi : T -> B, for i = 1, . . . , n be functions, and let Ai, . . . , A n be real 
numbers. If t 0 minimizes f(t) + £" =1 K9i(t) and satisfies gi(t 0 ) = c* for 
i = 1, . . . ,n, then t 0 minimizes f{t) subject to gi{t) < Ci for each Ai > 0 
and gi(t) > ci for each A* < 0. 

Proof. Suppose, to the contrary, that there exists ti such that f(h) < 
/(t 0 ), while #(ti) < ci for each A< > 0 and g^ti) > c { for each Ai < 0. 
Then 

n n n 

m) + £ K9i(ti) < /(to) + Yl X ^ = f<M + E A ^('o)- 

This contradicts the assumption that t 0 minimizes f(t) + *t0t W- D 

Corollary 4.77. 17 Let T be any set, let f : T -+ B and g { : T -> M, 
for i = l,...,n be functions, and let Ai,...,A n be real numbers. If t 0 
maximizes f(t) + ^=1 A <ft(0 anrf 5a ^ e5 S«(*o) = c< /or i = 1, ■ . . ,n, 



16 This lemma is used in the proof of Lemma 4.78. 
17 This corollary is used in the proof of Lemma 4.78. 



4.3. Most Powerful Tests 



247 



then to maximizes f(t) subject to gi(t) > Ci for each A* > 0 and gi(t) < C{ 
for each A* < 0. 

The following lemma is sometimes called the generalized Neyman-Pearson 
lemma due to the resemblance it bears to Proposition 4.37. 

Lemma 4.78. 18 Let po,Pi, • • • ,Pn be integrable (and not a.e. 0) functions 
(with respect to a measure v), and let 

( 1 ifPo(x)>EtikiPi(x), 
Mx) = < l{x) ifp 0 (x) = £" =1 kiPi{x), 
[0 ifpo(x)<Z? =1 k iPi (x), 

where 0 < 7(x) < 1 and the ki are constants. Then <f> 0 minimizes f[l - 
<l>{x)]po(x)dv(x) subject to 

f <i>{x)pj{x)dv{x) < J <j>o{x)pj{x)dv{x), for those j such that kj > 0, 
f <j>(x)pj(x)dv(x) > f <j> 0 {x)pj{x)dv{x), for those j such that kj < 0. 

Proof. Let <j> be an arbitrary measurable function taking values between 
0 and 1 which satisfies the preceding inequality constraints. Since <j>(x) < 
<fo{x) whenever po(x) - £f =1 ki Pi (x) > 0 and 0(x) > <f> 0 (x) whenever 
Po(x) - Y% =1 kiPi(x) < 0, it is clear that 

J [0(x) - 4> 0 {x)} p 0 (x) - ^ fc iPi (x)j dv(x) < 0. 
It follows from this that 

J [1 - &,(x)]po(x)<Mx) + £ k i I Mx)Pi{x)du{x) (4.79) 

< J \l - <j>{x)\p{x)dv{x) + ^ki j <P(x) Pi (x)dv(x). 

Now, let T be the set of all measurable functions from X to [0, 1], and let 

Sit) = f[l - t(x)]p 0 (x)du(x), 

9i(t) = j t{x)pi{x)dv{x), for i = 1, . . . , n. 

Equation (4.79) says that «/> 0 minimizes f(<j>) + £? =1 h9i(<t>). Lemma 4.76 
implies that <f> 0 minimizes f{4>) subject to the constraints. □ 
There is also a version of this lemma corresponding to Corollary 4.77. 

18 This lemma is used in the proofs of Theorems 4.82 and 4.104. 



248 Chapter 4. Hypothesis Testing 



Corollary 4.80. 19 Let po,Pi, • • • ,Pn be integrable (and not a.e. 0) func- 
tions (with respect to a measure u), and let 

!0 ifpo(x) > EjLi kiPi{x), 
7(x) ifp 0 (x) = Y%=i hp^x), 
1 ifPo{x) < YZ=i k iPi( x )> 

where 0 < 7(x) < 1 and the ki are constants. Then 0o maximizes f[l — 
(t>{x)]po(x)dv(x) subject to 

J (f)(x)pj(x)dv(x) > j (j)(x)pj(x)dv(x), for those j such that kj > 0, 

J (p(x)pj(x)dv(x) < J <t>(x)pj(x)dv(x), for those j such that kj < 0. 

The theorems we can prove assume that we are dealing with a one- 
parameter exponential family. 

Lemma 4.81. 20 Assume that the parametric family has monotone likeli- 
hood ratio. If <f> is an arbitrary test and 0\ < 62, define ai = (3(t>(0i) for 
i = 1, 2. Then there is a test of the form 

{1 ifci<x< c 2 , 
7i ifx = Ci, 
0 if c\> x or C2 < x, 

with c\ < C2 such that /3^ 0 (#i) = a* for i = 1,2. 

Proof. Define (p w to be the UMP level w test of H : 0 < 0\ versus 
A : © > 0i, and for each 0 < u < 1 — ai, set 

4u( x ) = <t>« 1 +u(x)-(t)u(x). 

First, note that since a x + u > u for all u, 0 < <t>' u (x) < 1 for all u and 
all x. This means that <t> f u is really a test. By design, /V tt (0i) = ol\. Also, 
<t)' u has the form of ip for each u (with Ci or c 2 possibly infinite). This is 
true since all 4> w (x) are 0 for small x and 1 for large x. By construction, 
(p' 0 = <j) ai is the MP level a x test of H' : 6 = 0\ versus A' : 6 = 0 2 , and 
( /> , 1 _ ai = 1 - 0i_ ai is the least powerful such test. Since 0 is also a level c*i 
test of H' versus A' , it follows that 

fy_ ai (O2)<*2<p K (0 2 ). 

If we can show that (3<f, w {0 2 ) is continuous in w, we can conclude that there 
exists w such that (0 2 ) = a 2 . The proof of this is left to the reader. 
(See Problem 40 on page 290.) It follows that <t>' w has the form of ip and 
/V (0<) = a< fori = 1,2. D 



19 This corollary is used in the proof of Theorem 4.82. 

20 This lemma is used in the proofs of Theorem 4.82 and Lemma 4.99. It re- 
sembles part of Theorem 1 on p. 217 of Ferguson (1967). 



4.3. Most Powerful Tests 



249 



Theorem 4.82. In a one-parameter exponential family with natural pa- 
rameter, ifQtj = (—oo, 0i]u[0 2 ,oo) andQ,A = (0i,02), mthO\ < 02, a test 
of the form 

{1 if c\ < x < c 2 , 
7< ifx = Ci, 
0 if c\> x or c 2 < x, 

with c\ < c 2 minimizes /3</>(0) for all 0 < 6\ and for all 0 > 02, and it 
maximizes (3<f>(0) for all 0i < 0 < 02 subject to /3^,(0i) = a* fori — 1,2 w/iere 
a i = P<t>o(Qi) f or * = 1,2. //ci,C2,7i,72 are chosen so that ot\ = a 2 = a, 
£/ien 0o is E/MP /eve/ a. 

PROOF. Suppose that 0o is of the form stated in the theorem, and let 
fx\e( x \0) — c(0)exp(0x), so that h(x) is incorporated into the measure v. 
Let 0i and 02 be as in the statement of the theorem, and let 0o be another 
element of Q. Define pi(x) = c(0*) exp^x), for i = 0, 1, 2. 

First, suppose that 0i < 0 O < 02- Set bi = 0* — 0 O for i = 1,2. Next, 
note that the function a\ exp(6ix) + a 2 exp(b 2 x) is strictly monotone if a\ 
and a 2 have opposite signs, and it is always negative if both a\ and a 2 
are negative. Suppose that we try to solve the following two equations for 
ai,a 2 : 

1 = aiexp(6ici) H-a 2 exp(62Ci), 

1 = ai exp(6ic 2 ) + a 2 exp(6 2 c 2 ), (4.83) 

where c\ and C2 are from the definition of fa. The reader can easily verify 
that this system of linear equations has nonzero determinant and hence has 
a solution. The solution must have both a\,a 2 > 0. Set ki = aic(6o)/c(6i). 
With these values of fci,fc2> apply Lemma 4.78. Note that minimizing 
/[l — (f)(x)]po(x)du(x) is equivalent to maximizing f3<f>(0o). The test that 
maximizes /?</>(0o) subject to /3<p(0i) < P<j> o (0i) for i = 1, 2 is 

4>{x) = 1, if c(0 o ) exp(0 o x) > fcic(0i) exp(0ix) + k 2 c(0 2 ) exp(0 2 x) (4.84) 

if this test satisfies the constraints. The test in (4.84) can be rewritten as 

<j){x) = 1 if 1 > a\ exp(&ix) + a 2 exp(&2x), 

where a\ and a 2 are the solutions to the two linear equations above. Since 
a\ exp(&ia;)+a2 exp(&2#) goes to infinity as x — ► ±oo, it follows that (j)(x) = 
1 for ci < x < C2, which leads to <j)(x) = (t>o(x). This same argument applies 
for all 0o between 6\ and 02, hence the same </>o maximizes 04,(0) for all 
0i < 0 < 0 2 . 

Next, try to minimize /3</>(0o) for 0o < 0i. (An identical argument works 
for 0o > 02-) This time, we will use Corollary 4.80. Set bi = 0; - 0o for 
i = 1,2. Next, note that the function aiexp(&ix) + a2exp(&2#) is strictly 
monotone if a\ and a 2 have the same signs. If a\ < 0 < a 2 , the derivative 



250 Chapter 4. Hypothesis Testing 



has only one zero and the function goes to 0 as x — * — oo and to oo as 
x — * oo. This means that the function equals 1 for only one value of x. 
Hence, the solution to the equations in (4.83) must have a\ > 0 > a2. 
Solve the equations and set ki = aic(8o)/c(9i). Since maximizing J[l — 
<t>(x)]po{x)dv(x) is the same as minimizing (3<f,(0o), the test that minimizes 
A/>(0o) subject to /fy(0i) > ai and /?</>(02) < «2 is 

0(x) = 1 if c{6 0 ) exp(6 0 x) < kic(6i) exp{0ix) + fc 2 c(0 2 )exp(0 2 z), 

with fci > 0 and fe 2 < 0. This can be rewritten as 

(j)(x) = 1 if 1 < ai exp(bix) + a 2 exp(6 2 x), 

with ai > 0 > a 2 and 6 2 > 6i > 0. Since aiexp(bix) + a 2 exp(6 2 x) goes 
to 0 as x — » — oo and goes to — oo as x — ► oo, it follows that <j){x) = 1 for 
ci < a; < c 2 . Once again, we get the same test for all 0o and the same test 
as before. 

Finally, consider the test (p a (x) = a, and now suppose that a\ = a 2 = a. 
Lemma 4.81 guarantees that ci, c 2 ,7i, and 7 2 can be chosen so that </>o has 
the stated form with a\ = a 2 = a. The power function of </> a is the constant 
a. It must be that /3<^ o (0) < a for every 0 € fin- Hence 0o has level a. Since 
every level a test -0 must satisfy (3^{6%) < ot (i = 1,2), and 0o maximizes 
the power on the alternative subject these constraints, it follows that 0o is 
UMP level a. □ 

Example 4.85. Suppose that Y ~ Exp(0) given 6 = 0. Let X = -Y so that G 
is the natural parameter. Let £Ih = (0, 1] U [2, oo), = (1, 2), and a = 0.1. We 
must solve the equations 

exp(c2) - exp(ci) = 0.1, 
exp(2c 2 ) - exp(2ci) = 0.1. 

If we let a = exp(c 2 ) and 6 = exp(ci), these equations simplify to a - b = 0.1 
and a 2 - b 2 — 0.1, respectively. The solution is easily calculated to be a = 0.55 
and b = 0.45. So the solution to the original equations is c x = -0.7985 and 
c 2 = -0.5978. Since the distribution is continuous, 71 = 72 = 0. We reject H if 
0.7985 > Y > 0.5978. 

Example 4.86. Suppose that X ~ Bin(n,p) given P = p. Let 0 = log(P/(l - 
P)), the natural parameter. Then /x|e(^|0) = c(0) exp(0z), where c(0) = (14- 
exp(0))~ n , and v is (™) times counting measure on {0,...,n}. The hypothesis 
H : P < 1/4 or P > 3/4 corresponds to ft H = (-00, -1.099] U [1.099, 00) in 6 
space. If n = 10 and a = 0.1, we get the UMP level a test by choosing ci = 4 
and C2 = 6 with 71 = 72 = 0.2565. 

Suppose that we have a prior distribution /x e on the parameter space. 
If there is a test with finite Bayes risk and the loss function is bounded 
below, then Theorem 4.82 allows us to conclude that tests of the form 
given in that theorem are formal Bayes rules. (See Problem 38 on page 290.) 



4.3. Most Powerful Tests 251 



Of course, if we switch the names of hypothesis and alternative, then the 
formal Bayes rule will be 1 minus the test in Theorem 4.82. This will be 
even more apparent after we see Lemma 4.99. One interesting difference 
between Bayes rules and UMP level a tests is that not all Bayes rules for 
testing two-sided hypotheses need to have the same value for the power 
function at the two endpoints of the alternative. 

Example 4.87 (Continuation of Example 4.85; see page 250). Once again, sup- 
pose that Y ~ Exp(0) given 0 = 0. Suppose that the prior distribution for 9 is 
Exp(l). The posterior distribution will be T(2, 1 + y). Let the loss function be 
0-1-c with c = 0.5. Solving numerically for the formal Bayes rule, we get 



, , v / 1 if 0.02133 < 
ny) ~ \ 0 otherwise. 



y < 0.83685, 



The power function of this test is 0.454 at 9 = 1 and 0.229 at 9 = 2. The level 
of the test is 0.454, but it is not UMP level 0.454. The UMP level 0.454 test 
would have power function 0.454 at 9 = 2, but would not be a Bayes rule for the 
stated decision problem. The intuitive reason for the lopsided power function of 
the formal Bayes rule is that the prior puts so much more mass below 1 than 
above 2 (0.3679 versus 0.1353). It makes sense that the test should protect more 
against alternatives with small 8 than those with large O. 

Curiously, however, if we use the improper prior with Radon-Nikodym deriva- 
tive 1/9 with respect to Lebesgue measure, the formal Bayes rule will be a UMP 
level a test for all 0-1-c losses. To see this, note that the posterior distribu- 
tion of 9 is Exp(y) The posterior probability that the hypothesis is true is now 
1 - exp(-y) + exp(-2y). To find the formal Bayes rule with 0-1-c loss, we set 
this expression equal to 1/(1 + c ) and solve for y. There will be two solutions 21 
ci < c 2 (the endpoints of the rejection region for the test), and they will satisfy 

1 - e'xp(~ Cl ) + exp(-2ci) = 1 - exp(-c 2 ) + exp(-2c 2 ) = — . 

1 + c 

Rearranging terms in this expression Jeads to 

exp(-ci) - exp(-c 2 ) = exp(-2ci) - exp(~2c 2 ). 

I!!? iL^'llT 1 Sl ? e ^ this last ec * uation is the power function of the test at 9 = 2, 

rfth*L h i Vu the P ° Wer function at 6 = L If a is the ^mmon value 
of the two sides, then the test is UMP level a. 

Theorem 4.68 says that the class of UMP level a tests is essentially com- 
plete for decision problems that include hypothesis-testing loss functions 
for one-sided hypotheses. The first part of Example 4.87 shows that for 
two-sided hypotheses, the class of UMP level a tests is not essentially com- 
plete. The formal Bayes rule given there is admissible and the risk function 
is not the same as that of any UMP level a test. This, then, is the first point 

"There may actually be only one solution or no solutions, because the posterior 
probability of the hypothesis is bounded below. In these cases, the formal Bayes 
rule always accepts H. J 



252 Chapter 4. Hypothesis Testing 



at which classical hypothesis-testing theory has departed from the decision 
theoretic approach to hypothesis testing. When we had simple hypotheses 
and alternatives or one-sided hypotheses and alternatives with MLR, the 
class of (U)MP level a tests included (essentially) all admissible procedures. 
When we move to two-sided hypotheses, we lose that property. One can 
prove, however (see Problem 37 on page 290), that the class of tests of the 
form given in Theorem 4.82 is essentially complete. The restriction to tests 
that are UMP for their level does not follow from considerations of admis- 
sibility. The argument to justify such a restriction might be, "In comparing 
two tests with the same level, I want to choose, if possible, the test with 
higher power function on the alternative." This would make perfect sense 
if the hypothesis consisted of a single point. (See Problem 16 on page 287.) 
However, the hypothesis is not a single point in general. The classical theory 
treats the hypothesis as if it were a single point and does not distinguish 
between tests based on their power functions on the hypothesis so long as 
they have the same level. To put the case more succinctly, the level of a 
test does not completely describe the power function on the hypothesis, 
but the classical theory pretends as if it did. That is why the formal Bayes 
rule in the first part of Example 4.87 on page 251 is lumped together with 
all level 0.454 tests even though it has advantages over some other level 
0.454 tests. These advantages are simply ignored when the level of a test 
is taken as the entire summary of the power function on the hypothesis. 

The restriction of attention to UMP level a tests, rather than all tests 
of the form of Theorem 4.82, has another consequence that is even more 
surprising, perhaps, than the fact that the tests do not form an essentially 
complete class. A simple example will illustrate the general situation. 

Example 4.88. Let X ~ iV(/x, 1) given M = fi. Suppose that we are considering 
two different hypotheses about M with 0/^ = (-oo, -0.5] U [0.5, oo) and Qh 2 = 
(- 00> -0.7]U[0.51, oo). The UMP level 0.05 test of Hi versus Ai : M € (-0.5, 0.5) 
is to reject Hi if X € (-0.071,0.071). The UMP level 0.05 test of H 2 versus 
A 2 :Me (-0.7,0.51) is to reject H 2 if X € (-0.167, -0.017). Since fln 2 C Qhi, 
it makes sense that if we reject ifi, then a fortiori we should be able to reject 
H 2 . However, if X G [-0.017,0.071), we would reject Hi at level 0.05 but accept 
H 2 at the same level. 

The type of contradictory conclusions we were able to obtain in Exam- 
ple 4.88 actually occurs quite generally in level a testing, once we leave 
the one-sided situation. (See also Problem 41 on page 290.) We will see 
them again in Section 4.4.2 and in Section 4.6. Gabriel (1969) introduced 
the concept of coherent tests of several hypotheses. A collection of tests of 
various hypotheses is coherent if rejecting one hypothesis H always leads 
to rejecting every hypothesis that implies H. Testing several hypotheses at 
the same level is not always coherent, as Example 4.88 shows. The problem 
lies in choosing tests based on their level rather than on decision theo- 
retic criteria. For example, if one were to reject hypotheses whose posterior 
probabilities were less than some number 7, then rejecting Hi would always 



4.4. Unbiased Tests 253 



lead to rejecting H2 when H2 implies H\ (as in Example 4.88). 

Example 4.89 (Continuation of Example 4.88; see page 252). Suppose that we 
use the usual improper prior (Lebesgue measure). Then the posterior distribution 
of M is N(x, 1). The level 0.05 test of Hi corresponds to rejecting Hi if the 
posterior probability of H\ is less than 0.618. The posterior probability of H2 is 
less than 0.618 whenever x e (—0.72, 0.535). Notice that this last interval strictly 
contains the rejection region for Hi, (-0.071,0.071), so that rejection of Hi will 
always imply rejection of Hi when rejection means "posterior probability less 
than 0.618." 

There is a natural sense in which incoherent tests are inadmissible. See 
Problem 42 on page 291. 



4.4 Unbiased Tests 
4.4.1 General Results 

For the cases not previously considered, there generally do not exist UMP 
level a tests. This is not to say that there are no good tests in other 
situations, but rather that the criterion of UMP level a needs to be relaxed 
if we are going to find the good tests. Consider the following example, which 
is typical of what happens when the alternative is two-sided. 

Example 4.90. Suppose that X ~ iV(0, 1) given 6 = 0 and that we wish to test 
H : 9 = 0 O versus A : 0 ^ 0 O at some level a £ (0, 1). The UMP level a test </>' of 
H versus A' : 6 < 0 O and the UMP level a test <£" of H versus A" : 6 > 0 O are 
both level a tests of H versus A. If there is a UMP level a test 0 of H versus A, 
then its power function must be at least as large as that of <j> for 0 < 0o and at 
least as large as that of <j>" for 0 > 0q. But such a <j> would also be a level a test of 
H versus A* ', so Proposition 4.67 says that either <j> = (j> a.s. 22 or there is 0 < 0o 
such that /3<t>(0) < /^(0). Since we are assuming that the latter is false, we must 
have 0 = 0' a.s. The same argument applied to A" and <t>" implies that (f> = <t>" 
a.s. [Pe] for all 0. But this is impossible since <f>{x) = 7(_ 00>c /](x), for some finite 
number c', and <t> ff (x) — J[ c // j00 )(x), for some finite number c". It follows that no 
test <f> is UMP level a for testing H versus A. 

The way to circumvent the lack of UMP level a tests in cases like Ex- 
ample 4.90 is to create a new criterion that one-sided tests fail to satisfy 
when the alternative is two-sided. 23 The rationale is that even though the 
power function of a one-sided test is high in one part of the alternative, it 
is very low in the other part. The new optimality criterion requires that 
the power function be higher on the alternative than on the hypothesis. 



22 Since all iV(0, 1) distributions are mutually absolutely continuous, a.s. with 
respect to one of them means a.s. with respect to all of them. 

23 When the conditions of Proposition 4.67 fail, there may be UMP level a tests 
for two-sided alternatives. See Problem 27 on page 288 for an example that even 
has MLR. 



254 Chapter 4. Hypothesis Testing 



Definition 4.91. A test 0 is unbiased level a if it has level a and if 0^(9) > 
a for all 9 € fi^. If C IR*, a test 0 is called a-similar if 13^(9) — a for 
each 9 € ft// n fi^- More generally, 0 is a-similar on B C f] if = a 
for each 0 e 23. If 0 is UMP among all unbiased level a tests, then <p is 
uniformly most powerful unbiased (UMPU) level a. 

The concepts of unbiased level a and a-similar are closely related. 

Proposition 4.92. 24 If a testcj) is unbiased level a andft^) is continuous, 
then <j) is a-similar. 

Proposition 4.93. If (j) is a UMP level a test, then <f> is unbiased level a. 

Since (f> being unbiased level a implies that <j> has floor a, the dual concept 
to unbiased level a is simply unbiased floor a. 

Definition 4.94. A test 0 is unbiased floor a if it has floor a and if /?</>(#) < 
a for all 9 € ft//. If <t> is UMC among all unbiased floor a tests, then </> is 
uniformly most cautious unbiased (UMCU) floor a. 

It is interesting to note that the collection of unbiased tests may not be 
essentially complete. The test in the first part of Example 4.87 on page 251 
is admissible with 0-1-0.5 loss, but it is not unbiased and it does not have 
the same risk function as an unbiased test. The restriction to unbiased 
tests, just like the restriction to UMP level a tests in the previous section, 
does not follow from considerations of admissibility. It is true that the 
restriction to unbiased tests rules out the use of one-sided tests in problems 
like Example 4.90 on page 253, but it also rules out many admissible tests. 

Example 4.95. Suppose that Y ~ Exp(0) given 6 = 6 with ft// = [1,2] and 
Q A = (-oo,l) U (2,oo). Suppose that the loss function is asymmetric in the 
following way. 

3 if 9 e SIh and a = 1, 



L(0,a) 



\ if 9 > 2 and a = 0, 
1 if 9 < 1 and a = 0, 
0 otherwise. 



We will use the usual improper prior with Radon-Nikodym derivative 1/6 with 
respect to Lebesgue measure so that the posterior distribution is Exp{y). The 
formal Bayes rule will minimize the posterior risk. The posterior risks for the two 
possible decisions are 



a = 0 


a = 1 


3(exp(-y) -exp(-2y)) 


\ exp(-2t/) + 1 - exp(-y) 



Solving to see when the risk for a = 1 is smaller, we see that this occurs when 
y < 0 2569 or y > 0.9959. The test that rejects when one of these conditions 
holds has power function 0.5959 at 9 = 1 and 0.5382 at 0 = 2. Since it is more 



24 This proposition is used in the proofs of Lemma 4.96 and of Theorems 4.123 
and 4.124. 



4.4. Unbiased Tests 255 



important to reject H when 9 is small, the power is higher for small 9 values. 
The test has level 0.5959, but it is biased. 

One technique for finding UMPU tests will be to restrict attention first 
to a-similar tests. The following lemma shows why this will work in many 
cases. 

Lemma 4.96. 25 Suppose that fi^-) is continuous in 9 for every <j>. If </> 0 
is UMP among a-similar tests and has level a, then 0o is UMPU level a. 

Proof. Since ip(x) = a is a-similar and 0 O is UMP a-similar, it follows 
that /?</> o (0) > a for every 9 € CIa- Since 0 O has level a, it is unbiased level 
a. Every unbiased level a test is a-similar by Proposition 4.92. So, the test 
that is UMP among a-similar tests, namely <£o, has power function at least 
as high (on the alternative) as the test that is UMPU level a. But 0 O is 
also unbiased level a. Hence 0 O is UMPU level a. □ 

Proposition 4.97. 26 Suppose that is continuous in 9 for every <j>. If 
0o is UMC among a-similar tests and <j) 0 has base a, then <f> 0 is UMCU 
floor a. 



4.4.2 Interval Hypotheses 

In this section, we will consider the case in which the alternative is two- 
sided and the hypothesis is a nondegenerate compact interval. That is, the 
case H:Ge [0 U 9 2 ] versus A : 0 £ [0 1? 0 2 ], with 9 X < 9 2 . It turns out that 
there is no UMP level a test of H versus A in one-parameter exponential 
families (for a > 0). One would suspect that if <f> were the optimal level 
1 - a test of A versus H as derived in Theorem 4.82, then 1 - 0 would be 
the appropriate level a test for H versus A. 27 This will, in fact, turn out to 
be the case, but the test is no longer UMP level a, but only UMPU level 

a 28 

Example 4.98 (Continuation of Example 4.87; see page 251). In this example, 
Y ~ Exp{9) l given 9 = 0. For consistency with the classical approach, suppose 
that we use the improper prior with Radon-Nikodym derivative 1/9 with respect 
to Lebesgue measure. The posterior distribution of 9 given Y = y is Expiy) 
At the end of Example 4.87 on page 251, we saw that the formal Bayes rule 
£ iVi < 1 ° SS 7° 1 u + ld be a UMP level « test. Suppose now that we switch the 
hypothesis and alternative, so that Q H = [1,2] and Q A = (-oo, 1) U (2 oo) 
Suppose that we use the 0-1-c loss with c = 3.04. The posterio irdtebS 



^This lemma is used in the proof of Theorem 4.100. 
^This proposition is used in the proof of Theorem 4.100. 



One can prove (see Problem 30 on page 289) that 1 - 0 is UMC floor a for 
testing H versus A. But this is not the most popular optimality criterion. 

•a ?u e ?u ld alS ° be aWare that there is no UMC floor " test in the case of two- 
sided hypotheses m exponential families. The concepts of UMP and UMC reallv 
are dual to each other. Neither of them is the unique best optimality criterion 



256 Chapter 4. Hypothesis Testing 



of Qh is exp(— y) — exp(— 2y). We then reject if, that is, we choose a = 1, if 
exp(-y) - exp(-2y) < 1/4.04, which is true if y > 0.7985 or y < 0.5978. This is 
the same (a.s.) as 1 minus the UMP level 0.1 test in the earlier example. Note 
that the conditions of Proposition 4.67 are met in this example, and so no test 
will be UMP level 0.9. The test we have just constructed will be UMPU level 0.9, 



We are now in position to begin to prove that the UMPU level a test for 
the case of two-sided alternatives in a one-parameter exponential family is 
just 1 minus the UMP level 1 - a test for the two-sided hypothesis. 

Lemma 4.99. 29 In a one-parameter exponential family with natural pa- 
rameter, if <\> is any test of H : © € [#i,02] versus A : Q & [#i,02] with 
0\ < 02, then there is a test of the form 



Proof. Lemma 4.81 says that 1 - ip can be chosen to have P\-^(0i) = 
Pi-<t>(0i) and thus that ip is in the desired form. Theorem 4.82 then shows 
that 1 - ip minimizes and maximizes power in just the opposite regions 
from where we want ip to minimize and maximize power under the same 
conditions. n 
The tests in Lemma 4.99 are called two-sided tests. It is easy to see that 
when the conditions of Lemma 4.99 hold, the class of two-sided tests is 
essentially complete for hypothesis-testing loss functions. (See Problem 48 
on page 292.) The next theorem says that the UMPU level a tests are a 
subset of this essentially complete class. We could show, as in Example 4.87 
on page 251, however, that this subset is not essentially complete. 

Theorem 4.100. Assume the same conditions as Lemma Also sup- 
pose that A/,(0i) = a for i = 1,2. A test of the form ip is UMPU level a 
and UMCU floor a. 

Proof. By comparing ip with 0 Q (x) =s a and using Lemma 4.99, we see 
that ip is unbiased level a and unbiased floor a. Also, Lemma 4.99 shows 
that V is UMP a-similar and UMC a-similar. Lemma 4.96 and Proposi- 
tion 4.97 can be applied since the power functions are continuous in an 
exponential family. D 



however. 




1 if x < c\ or x > C2 

if X — Ci, 

0 if c\ < x < C2, 



with fy(0i) = /3tf>(0i) for i = 1, 2 and 




29 This lemma is used in the proof of Theorem 4.100 and to show that the class 
of two-sided tests is essentially complete. 



4.4. Unbiased Tests 257 



Example 4.101. Suppose that X ~ iV(/x, 1) given 6 = \i. Let Qh = [-1, 1] and 
a = 0.1. Set c 2 = 2.286 and a = -2.286. 0i = -1 and 6 2 = 1. So 



/MftO = Pr(N(l, 1) £ [-2.286, 2.286]), 

= 1 - ($(1,286) - $(-3,286)) = 0.1, 

= 1 - ($(3,286) - $(-1,286)), 

= Pr(AT(-l, 1) £ [-2.286, 2.286]) = M0 X ). 

Hence, 1> is UMPU level 0.1 and UMCU floor 0.1. 

Lest the reader think that UMPU and UMCU tests are always the same, 
note that the UMP and UMC tests in Problem 31 on page 289 are both 
unbiased, but they are not the same. Those tests are one-sided however. 

We should also note that the type of contradictory conclusions drawn in 
Example 4.88 on page 252 also arises for interval hypotheses. The following 
example is adapted from Schervish (1996). 

Example 4.102. Suppose that X ~ JV(/x, 1) given M = /i. We wish to consider 
two different hypotheses, Hi : M € [-0.5,0.5] versus A\ : M 0 [-0.5,0.5] and 
H 2 : M G [-0.82,0.52] versus A 2 : M 0 [-0.82,0.52]. The UMPU level 0.05 test 
of Hi is to reject Hi if X 0 [-2.185,2.185]. The UMPU level 0.05 test of H 2 is 
to reject H 2 if X £ [-2.475,2.175]. So, if X e (2.175,2.185], we would reject H 2 
and accept ifi, even though Hi implies H 2 . 

If we had used Lebesgue measure for an improper prior, the posterior proba- 
bility of Hi given X = x is less than 0.0424 if x & [-2.185,2.185], the rejection 
region for the UMPU level 0.05 test. The posterior probabilty of H 2 given X = x 
is less than 0.0424 if x & [-2.531,2.231]. So, if the decision rule is to reject the 
hypothesis if the posterior probability is less than 0.0424, then we would reject 
Hi whenever we reject H 2 . 

4.4.3 Point Hypotheses 

In this section, we will deal with the case in which f] # = {0o} and Qa = 
fi \ {0 O }• This is like the case of an interval hypothesis with a two-sided 
alternative, except that the interval is degenerate. The proofs of some of the 
results for two-sided alternatives relied on the fact that the two endpoints 
of the hypothesis were distinct. When the endpoints are the same, some 
changes are required. 

Lemma 4.103. 30 Suppose that the power functions of all tests are differ- 
entiable. If<j> is the UMP level a test of H : © = 6q versus A : 6 < 0o> then 
the derivative of the power function of </> at 0o is smallest among all tests 
with size a. Similarly, if (j> is the UMP level a test of H : 6 = 0o versus 
A : 0 > 0o, then the derivative of the power function of </> at 0o is largest 
among all tests with size a. 




if x < -2.286 or x > 2.286, 
otherwise; 



This lemma is used in the proof of Theorem 4.104. 



258 Chapter 4. Hypothesis Testing 



Proof. We prove only the first part, since the second is very similar. By 
Lemma 4.45, it follows that /3^(0o) = a if 0 is UMP level a. Let V be 
another size a test. Since 0 is UMP level a, for every € > 0, 

0*(0o-c) > /fy(0o-c), 
a - P<i>{Oo ~ e) < ol- 13^,(00 - e), 

e ~ e 

Since the derivatives are the limits of the quantities in the last inequality 
as e goes to 0, the result follows. □ 

Theorem 4. 104. 31 In a one-parameter exponential family with natural 
parameter, let Qh = {^o}, where 9q is in the interior of ft. Let <p be any 
test of H versus A : 0 ^ 9q . Then there is a test of the form of xf> in 
Lemma 4-99 such that 

MOo) = /J*(0o), 

d 



MO) 



(4.105) 

0=0o 

and, for every 9 ^ 9q, (3^(9) is maximized among all tests ip satisfying the 
two equalities above. 

Proof. Let a = /fy(0 o ) and 7 = dp <t> (9)/d9\ e=0Q . Let <f> w be the UMP level 
w test of H : 0 > 0o versus A : Q < 0 0 , and for each 0 < u < a, set 

By design, /fy tt (0o) = a for all u. Also, (j)' u has the form of tp for every u 
(with ci or C2 possibly infinite). By construction, = 1-</>i_ q is the UMP 
level a test of H' : 0 = 0 O versus A' : 0 > 0 O and 0' a = <p a is the least 
powerful such test. It follows from Lemma 4.103 that the derivatives of the 
power functions of and </>' a at 0 O are respectively the smallest possible 
and the largest possible among all tests with power function a at 9 0 . Hence 

<7<|/W') • 

0X9 0=0o 

To prove that there is a ^ satisfying (4.105), we need only show that 
d(3 w {9)/d6\ e=eo is continuous in w. 32 Recall that 

1 if x < c w , 
4> w (x) = { i w if x = c wy 
0 if x > c w , 



31 This theorem is used in the proofs of Corollary 4.109 and Theorem 4.124. It 
is also used to show that two-sided tests form an essentially complete class. 

32 The proof follows part of the proof of Theorem 2 on pp. 220-221 of Ferguson 
(1967). 



4.4. Unbiased Tests 259 



for some numbers c w and *y w such that 

P+M = fto(* < Cw) + 7^ 0 (* = O = ™- 



(4.106) 



Define h(x,g) = -P^ 0 (^ ^ x ) + 9Pe 0 (X = x), and define the random 
variable V = h(X y G), where G has C/(0, 1) distribution and is independent 
of X and 6. For 0 < w < 1, we note that 



and 



= inf{u : F X \e{u\8) > w} 
j w Pe 0 {X = c w ) = [w- P 6o (X < c w )]. 



For w = 0, co = sup{u : F X | 0 (u|0) = 0} and 70 = 0. It follows that for all 
*» ftfop) < * if and only if either z < c t or x = c t and # < 7*. For t > 0, 
we have 



Fv\e(t\0 o ) = f j I[o,t}(h(x,g))f xlB (x\e 0 )d9du(x) 

-oo,c t 

)(x) + I{c t }(x)I[o,~, t ](g)]dgfx\e(x\9 0 )dv(x) 
= I lh-°°,c t )( x ) + 7tI{ Ct }( x )lfx\e(x\0o)dv(x) (4.107) 
&(*)/x|eOr|0o)<M*) = PM) = t, 



/' 
■/■ 

by (4.106). Hence V has £/(0, 1) distribution given 9 = 0 O . Prom (4.107), 
we can write (j> w {x) = E{I [0M (V)\X = x}. It follows from Theorem 2.64 
that for every test 77, 



e=e Q 



e=o 0 



It follows that 



6=00 



d f 

jjj J v(x)c(0) exp{0x)dv(x) 
fv(x) ^c(8)exp(9x) 
J v(x)[xc(0) + c'(0)] exp(6x)dv{x) 

^6 0 {x v (x)] - 0 v (e o )E 0o (x). 

= V0 o {X<t> w (X)}-wE 9o (X) 
= V eo {XI [OM {V)}-wE 0o (X). 



Since w is continuous and V has continuous distribution, it follows that 
the above expression is continuous in w. 



260 Chapter 4. Hypothesis Testing 



What remains to be proven is that maximizes the power function 
among all tests with a fixed size and a fixed value of the derivative of 
the power function. As in the proof of Theorem 4.82, let fx\e( x \6) = 
c(0)exp(0#), so that h(x) is incorporated into the measure v. For a test 
77 with /^(flo) = the derivative of the power function at 0o will equal 
7 if and only if E 6o [Xr](X)} = 7 + aE 6Q (X). Let 0i ^ 0 O . We now apply 
Lemma 4.78 with 

Po(x) = c(0i)exp(0ia:), 
pi(x) = c(0 o )exp(0 o x), 
p 2 {x) = £c(0o)exp(0 o x). 
The test 77 with the largest power at 6\ subject to 

A,(0o) ^(^) a > 

d 



<(>) 7 

6>=0 O 



is ry(x) = 1 if 

exp(0ix) > k\ exp(0ox) + A: 2 xexp(0 o a:), (4.108) 

where the signs of fci and k 2 depend on which inequalities we use. The 
inequality (4.108) simplifies to exp([0i - 0 o ]x) >fci+ k 2 x. This inequality 
is satisfied for x outside of a bounded interval or for x in a semi-infinite 
interval. We already know (Theorem 4.56 and Propositions 4.62-4.64) that 
tests with ^o(^) = 1 for x in a semi- infinite interval are one-sided and they 
minimize the power function on one side of the hypothesis. Hence, we need 
^o(x) = 1 for x outside of a bounded interval, and xpo has the form of ?/>. 
Furthermore, the same Vo works whether 0i > 0o or 0i < 0 O , by choosing 
k\ and k 2 correctly. D 
When the conditions of Theorem 4.104 hold and the loss function is 
of the hypothesis-testing type, then it follows that the class of two-sided 
tests is essentially complete. Corollary 4.109 says that the class of UMPU 
level a tests is a subset of this essentially complete class. At the end of 
Example 4.111 on page 261, we will see that the class of UMPU tests is 
not essentially complete. 

Corollary 4.109. 33 In a one-parameter exponential family with natural 
parameter, let SI H = {0o}, where e o is in the i nterior of SI. If i/> is a size a 
test of the form of Lemma with 



= 0, 

0=00 



then it is UMPU level a. 



33 This corollary is used in the proof of Theorem 4.124. 



4.4. Unbiased Tests 



261 



Proof. Since the test <t> a (x) = a has size a and 0 derivative at 0o, Theo- 
rem 4.104 tells us that is unbiased level a. In light of Theorem 4.104, all 
we need to show is that all unbiased level a tests must have power func- 
tions with 0 derivative at 6$. Any test 0 with /?</>(0o) = a but with nonzero 
derivative will have power strictly less than a on one side or the other of 
#o because the power function is differentiate. Such a test could not be 
unbiased level a. □ 

Example 4.110. Suppose that X ~ iV(0, 1) given 0 = 0 and that we wish to 
test H :S = 0q versus A : 6 ^ 0o. To make the test t)j unbiased, we need 



which is true if and only if — (ci — 0o) = C2 — 0o = c. In this case fy(0o) = 
2[1 — $(c)] = a if and only if c = $ -1 (l — a/2). This gives the usual equal-tailed, 
two-sided test, which is UMPU level a. 

In Example 4.110, if a Bayesian used a proper continuous prior distribu- 
tion, then Pr(6 = 0o) =0 both before and after observing X. There are at 
least two ways to treat point hypotheses from a Bayesian perspective. One 
is to treat them as surrogates for interval hypotheses in which the length of 
the interval has not been stated. Another is to assign positive probability 
to the point hypothesis. 

Example 4.111 (Continuation of Example 4.110). Suppose that I really want 
to test H' : |G — 0o| < 6 versus A' : |0 — 0o| > 6. Suppose that the prior 
distribution of G is JV(0o,r 2 ). Then the posterior distribution of 0 given X = x 
is JV(0i,r 2 /(l 4- r 2 )), where 



If we use a 0-1-c loss function, the Bayes rule is to reject H' if its posterior 
probability is less than 1/(1 + c). The posterior probability of H' is 



0 = ±MB) 





262 



Chapter 4. Hypothesis Testing 



which is clearly a decreasing function of 



|0 O -0i|= 00- 



00 + XT 2 

1 + r 2 



1-hr 2 



|0o-x|. 



So, the Bayes rule is to reject H' if \x — 0o| > d for some d. This has the same 
form as the UMPU level a test. 

Alternatively, suppose that Pr(8 = 0o) = po > 0. Conditional on 6 ^ 0o, 
suppose that G ~ AT(0o,t 2 ). We computed the Bayes factor for this case in 
Example 4.18 on page 222. The Bayes factor was given in (4.19) as 



(See Problem 6 on page 286 for the entire posterior distribution of 6.) If we use 
a 0-1-c loss function, then the Bayes rule is to reject H if the probability that 
it is true is less than 1/(1 4- c). This corresponds to the Bayes factor being less 
than some number, which in turn is easily seen to correspond to \x — 0o| > d for 
some d. This is in the same form as the UMPU level a test. 

Finally, suppose that we continue to let Pr(6 = 0o) = po > 0, but that the 
conditional prior given 0 ^ 0 O is 7V(0',t 2 ) with 0' ^ 0o. Then the same kind 
of calculation as above leads to the Bayes factor being small when x is far from 
[(1 +t 2 )0 o - 0']/r 2 . Such a test is two-sided but is not UMPU of its level. In fact, 
the test is biased, even though it is admissible. We see once again that the class 
of UMPU tests is not essentially complete. 

In Example 4.111, two different types of prior distributions both led to 
Bayes rules that were of the same form as the UMPU level a test. Unfor- 
tunately, there did not appear to be any transparent connection between 
the size a and the loss function or the prior distribution. The reason for 
this is related to the inadmissibility of incoherent tests as illustrated in 
Problem 42 on page 291. We will discuss this matter more in Section 4.6. 

Example 4.112. Suppose that X ~ Bin(n,p) given 6 = log(p/(l - p)). Then 
the density of X with respect to counting measure on {0, ... , n} is 





Let Q H = {Oo} and Q A = H\{0o}. The UMPU level a test is 




(4.113) 



where c\ < C2. Supposing that c\ < C2, we have 




4.4. Unbiased Tests 263 



It follows that 



— n 



(1 - 7i ) ("J exp(ci^) + (1 - 72) ("J exp(c 2 0) 



exp(0)[l + exp(0)]~ 



X = C1+1 ^ ' 

[1 + exp(0)]- n (1 - 7 i)ci Q exp( Cl 0) 
(l-72)c2 n exp(c 2 0) + ^2 [J 30 ex P( x ^) 

^ ' x=ci + l ^ ' 



Once Ci and C2 are determined, solving for 71 and 72 amounts to solving two 
linear equations. Now, suppose that Go = log(0. 25/0.75). Then, with a = 0.05 
and n = 10, we get (after some numerical calculation) 

Cl =0, 71 = 0.52804, 
c 2 = 5, 72 = 0.00918. 

Most people who want a level 0.05 test of this hypothesis would not bother to 
compute the UMPU level 0.05 test but rather would perform what is called an 
equal-tailed test. Since Theorem 4.104 says that the two-sided tests are admissible, 
we could try to find a two-sided test of the form (4.113) such that the probability 
of rejecting for small X equals the probability of rejecting for large X (both equal 
to 0.025). In this case, the test would have 

ci =0, 71 = 0.44394, 
c 2 = 5, 72 = 0.09028. 

This test is biased because the derivative of the power function is 0.0236 at #o. In 
other words, the probability of rejecting the hypothesis will be slightly less than 
0.05 given 0 = 6 for a short interval of 0 values below 0q. 

One possible Bayesian solution would be to set Pr(F = po) = qo and let 
P ~ Beta(ao, 0o) otherwise, where P = exp(0)/(l -hexp(G)). Then, the Bayes 
factor will be 

pg(i-po) w "*nrjo(«o+ft>+o 



In the special case with ao = 0o = 1, the Bayes factor is 



(n + l) pg(l-po)" 

i X 1 



(4.114) 



(4.115) 



These values have been calculated for n = 10 and po = 1/4 in Table 4.116 
together with the posterior probability when qo = 1/2. Note that if we used a 
0-1-c loss function with c = 19 (so that 1/(1 + c) = 0.05), we would still accept 
H even when X = 6 was observed. 

As we noted in Section 4.2.2, we would run into trouble if we naively tried to 
use an improper prior (with ao = Po = 0) for the alternative. The Bayes factor 



264 Chapter 4. Hypothesis Testing 



Table 4.116. Bayes Factor and Posterior Probability in Binomial Example 



X 


Bayes Factor 


Posterior Prob. 


0 


0.619 


0.3825 


1 


2.064 


0.6737 




o.uy / 


u. /ooy 


3 


2.753 


0.7335 


4 


1.606 


0.6163 


5 


0.642 


0.3911 


6 


0.178 


0.1514 


7 


0.034 


0.0329 


8 


0.004 


0.0042 


9 


3 x 10 -4 


0.0003 


10 


1 x 10" 5 


1 x 10" 5 



in (4.114) would become oo in this case. On the other hand, if we let qo go to 
zero at a rate such that 

qo) and the Bayes factor would 

^ \n-x 

Po) 

if both x > 0 and n - x > 0. This has a form similar to (4.115). 

Another Bayesian solution would be to replace H and A by H' : |0 — 0o| < & 
and A' : |6 - 0 O | > 5. Suppose that P ~ £e*a(a 0 , /?o)- The posterior distribution 
of P given X = x is £e£a(a 0 + x,/? 0 + n - x). It is an easy matter to calculate 
Pr(i/' true \X = x) for various values of S and x. Figure 4.119 gives plots of 
Pr(|P - 1/4| < 6\X = x) for a 0 = Po = 1 for all values of x = 0, . . . , 10 when 
n = 10. For example, suppose that 8 = 0.1. We see that for x = 0, . . . , 5, the 
posterior probability of the hypothesis is greater than 0.05. So, if c = 19 and 
we use the 0-1-c loss function, we would accept H' if X < 5 and would reject 
otherwise. 

Notice that the condition that the derivative of the power function be 0 
in Example 4.112 was equivalent to 

aE eo X = E 0o (X<j>(X)) (4.117) 

if the size is a. This is true in general in exponential families. 

Proposition 4.118. 34 If X has a one-parameter exponential family dis- 
tribution with natural parameter 9 and (j> is a test of H : 6 = 0o with size 
a, then (3$ has 0 derivative at 0 O if and only if (4-111) holds. 



then the product of the prior odds ratio <?o/(l — 
converge to 

ik(»-i)(;:jW- 



34 This proposition is used in the proof of Theorem 4.124. 



4.5. Nuisance Parameters 265 




h 1 1 1 1 r~ 

O.O 0.1 0.2 0.3 0.4 0.5 

5 



Figure 4.119. Pr ( \P - \ \ < 6\ X = x) for all 6 < \ and all x 

When UMPU level a tests do not exist, one can try to find LMPU 
(locally most powerful unbiased) level a tests. 35 When power functions 
are continuously differentiable and 0 is the unique unbiased level a test of 
H : 6 = 0o with maximum second derivative for the power function, then (j) 
is LMPU level a relative to d(0) = \0 - 0 0 \. (See Problem 50 on page 292.) 

4.5 Nuisance Parameters 

When the parameter is multidimensional and Q,h is a smaller-dimensional 
space, the remaining dimensions are often called nuisance parameters for 
reasons that will become apparent shortly. In a Bayesian analysis, one must 
integrate nuisance parameters out of the posterior joint distribution of the 
parameters and base inference on the marginal distribution of the param- 
eters of interest. This can be a nuisance also. 

4.5.1 Neyman Structure 

The approach that we will take to finding UMPU tests in the presence 
of nuisance parameters is to find a statistic T such that the conditional 
distribution of the data given T has a one-dimensional parameter. In many 
cases, it will then turn out that the UMPU test among all tests that have 



We leave it to the interested reader to write a formal definition of LMPU. 



266 Chapter 4. Hypothesis Testing 



level a conditional on T will also be UMPU unconditionally. If a test is 
a-similar conditional on T, we say that it has Neyman structure. 

Definition 4.120. Let G C ft be a subparameter space corresponding to 
a subfamily Qq of Vo, and let \I> : Qq — > G be the subparameter. If T is a 
sufficient statistic for ^ in the classical sense, then a test (j> has Neyman 
structure relative to G and T if Ee[<l>(X)\T = t] is constant in t a.s. [P$] for 
all (9 EG. 

It is easy to see that if Q 0 = {Pe : 0 € ft// fl ft^} and if <j> has Neyman 
structure, then <f> is a-similar. We will prove later (Lemma 4.122) that under 
certain conditions all a-similar tests have Neyman structure. In these cases, 
one can find UMP a-similar (hence UMPU level a by Proposition 4.92) 
tests by restricting attention to tests with Neyman structure. Consider the 
following example. 

Example 4.121. Suppose that Xi,...,X n are IID AT(/z, a 2 ) random variables 
conditional on (M, E) = (^, a). The usual two-sided t-test of H : M = /io versus 
A : M ^ /i 0 is 

4>{x) 



0 otherwise, 



^r-i, (i-f), 



where T„-i is the CDF of the t n -i(0, 1) distribution and s 2 = ^" =1 (xi-a;) 2 /(n- 

1). Here, the intersection of fin with Ha is Ah = {(m^) : A* = Mo}- It is easy to 
see that <j> is a-similar, as follows. Given (M, E) = (/io,c) € fij/, the conditional 
distribution of T = (X - IM>)/(S/y/n) is t„_i(0, 1) for all a. Hence 



Let the subparameter space be ft// itself. A sufficient statistic for the subparam- 
eter S is 

n 

[/ = J2(Xi - no) 2 = (n - 1)5 2 + n(X - M o) 2 . 

i=l 

We can write 

W _X- Mo sign(T) 



so that W is a one-to-one increasing function of T and </>(X) is a function of W. 
We need to show that the conditional distribution of W given (M,£) = 7 and 
U = u is the same for all 7 € ft// and all u. If this is true, then W is independent 
of U given (M, E) = 7 € ft// and 

E 7 [*(X)|E7 = u]=E 7 [0(X)]=a > 

for all 7 £ ft// and 0 would have Neyman structure relative to the ft// and (7 . 

Since the distribution of (Xi - a*o, . . . , X n - Mo) is spherically symmetric, we 
showed in Examples B.56 and B.60 (see pages 627 and 628) that the conditional 
distribution of x 

[ Xi - Mo X n - Mo \ 

V Vu ' ■ " J 



4.5. Nuisance Parameters 267 



given U is uniform on the sphere of radius 1 and is independent of U. Hence W 
is independent of U given (M, E) = 7 G 11//. 

Lemma 4.122. 36 IfT is a boundedly complete sufficient statistic for the 
subparameter space G C f2, then every a- similar test on G has Neyman 
structure relative to G and T. 

Proof. By a-similarity, E 9 {E 0 {(t)(X)\T) - a} = 0, for all 0 G G. Since T 
is boundedly complete, it must be that E^[0(X)|T] = a, a.s. [Pq] for all 
6eG. □ 
We can now combine this result with Proposition 4.92 to conclude a 
useful result for identifying cases in which UMPU tests exist. 

Theorem 4.123. Let G = Qh H Qa- Let I be an index set, and suppose 
that G — DieiGi, where the subsets Gi form a partition ofG. Suppose that 
there exists a statistic T that is a boundedly complete sufficient statistic for 
each subparameter space Gi, i G J. Also, assume that the power function 
of every test is continuous. If there is a UMPU level a test <j) among those 
which have Neyman structure relative to Gi and T for all i G /, then <j> will 
be UMPU level a. 

Proof. Because the power functions are continuous, Proposition 4.92 says 
that all unbiased level a tests are a-similar. Lemma 4.122 says that because 
there is a boundedly complete sufficient statistic T for each subparameter 
space Gi, every a-similar test has Neyman structure relative to Gi and T. 
The result now follows. □ 
The way that Theorem 4.123 is generally used is the following. We sup- 
pose that power functions are continuous and that there exists a partition 
of G = Qh H Q.a into one or two sets (G = Go or G = G\ U G2) and 
a statistic T that is a boundedly complete sufficient for each Gi. We also 
suppose that the conditional distribution of X given T is a one-parameter 
family with parameter #(6) for some function g. We also suppose that Q,h 
can be written as {0 : g(0) < bo} or {6 : g(9) G [60, 61]} or one of the other 
forms with which we are already familiar. So, for example, if 6 = (M, E) 
and Q H = : h < M < M> then 9{v,<r) = V, Gi = {(/x,<r) : [i = 60}, 

and G2 = {(/i,^) : /i = &i}. For the one-sided cases, we assume that the 
family of distributions of X given T has MLR, but for the other cases, 
the conditional distribution of X given T should be a one-parameter ex- 
ponential family with natural parameter #(6). We then find the UMP or 
UMPU level a test of H conditional on T. For all of the cases, except for 
the case in which the hypothesis is H : g(S) = bo, the UMP or UMPU 
level a test conditional on T will also be UMP or UMPU among tests 
with Neyman structure. The reason is that these tests are derived as UMP 
or UMPU among all tests that satisfy conditions that are equivalent to 



This lemma is used in the proofs of Theorems 4.123 and 4.124. 



268 Chapter 4. Hypothesis Testing 



having Neyman structure. For example, when H : g(&) e [60, &i]> the 
two-sided test is conditionally UMPU level a among all tests that have 
level a and have conditional power function equal to a at g(6) = 60 and 
g(0) = b\. This is exactly what it means to have Neyman structure relative 
to G 0 = {6 : g(9) = 6 0 } and relative to G 0 = {0 : g{0) = h}. For the 
case of H : g(Q) = 6 0 , the two-sided test is conditionally UMPU level a 
among those tests with conditional power function a at g(9) = bo and with 
derivative of the conditional power function equal to 0 at g(d) = 60 • This 
last condition is not part of the definition of Neyman structure. Hence, 
in these cases, we need to prove that every Neyman structure test also 
satisfies this last condition. Problem 58 on page 293 is an example of this 
situation. In multiparameter exponential families, when g(Q) is just one of 
the coordinates of 0, every a-similar level a test will have zero derivative 
for the conditional power function, as we prove in Theorem 4.124. 

4.5.2 Tests about Natural Parameters 

The case in which we can prove the most complete result is that of an 
exponential family in which the hypothesis concerns one of the parameters. 

Theorem 4.124. Let (X\, . . . ,Xk) have a k-parameter exponential fam- 
ily distribution with natural parameter 0 = (0i, . . . ,0*), and let U = 



1. Suppose that the hypothesis is onesided or two-sided concerning only 
01. Then there is a UMP level a test conditional on U, and it is 
UMPU level a. 

2. If the hypothesis concerns only 0i and the alternative is two-sided, 
then there is a UMPU level a test conditional on U, and it is UMPU 



Proof. Suppose that the density of X with respect to a measure v is 
written j , 



Let G = TLh n ft a, the intersection of the closures of the hypothesis 
and alternative sets. The conditional density of X\ given (X 2 , . . . , X k ) = 
(ar 2 , • . • , Xk) with respect to the measure dv x \ U (xi\u) (from Theorem B.46 
with X = IR and U = JR k ~ l ) is 



(X2,...,Xfc). 



level a. 




/xxie^ilM) = 



exp(0\xi)h(x) 



J h(x) exp(l9lXl)d^| W (xl|u) , 



which can be seen to depend on 6 only through 0i. So, for each vector u, 
conditional distribution of X x is a one-parameter exponential family v 



4.5. Nuisance Parameters 269 



natural parameter 0i. For the hypotheses considered in this theorem, the 
subparameter space G is either the set G 0 = {0 : 0 X = 0?} for some 0j or 
the union of two such sets, G\ = {0 : 9\ = 0}} and G2 = {0 : 0i — 9\}. For 
each such subset of fi, the subparameter ^ = (62, . . . , 6^) has complete 
sufficient statistic U = (X2, . . . , X*). Let 77 be an unbiased level a test. It 
follows from Proposition 4.92 that 77 is a-similar on Go, or on G\ and on G2, 
whichever is appropriate. By Lemma 4.122, 77 has Neyman structure. Also, 
for every test 7/, (3^(9) = E^(E^[t/(X)|[/]), so that a test that maximizes the 
conditional power function uniformly for 0 € Qa subject to constraints also 
maximizes the marginal power function subject to the same constraints. 

For part (1), in the conditional problem given U = u, there is a level 
a test <j> that maximizes the conditional power function uniformly on $Ia 
subject to having Neyman structure (see Theorem 4.56, Propositions 4.62- 
4.65, and Theorem 4.82). Since every unbiased level a test has Neyman 
structure, and the power function is the expectation of the conditional 
power function, <j> is UMPU level a. 

For part (2), we consider two cases. First, suppose that fi# = {0 : c\ < 
0i < C2} with C2 > c\. Then, Lemma 4.99 shows that there is a test (j> 
whose conditional power function is maximized uniformly on Qa subject 
to having Neyman structure. It follows as before that 0 is UMPU level a. 
Finally, suppose that fin = {0 : 0\ = 0^}. If 77 is unbiased level a, then 
Pri must have zero partial derivative with respect to 6\ evaluated at every 
point in G. Using Theorem 2.64, just as in the proof of Theorem 4.104, we 
get, for every 9* £ G, 



= E e ^[X 1 r ) (X)-aX 1 }. 

9=0* 



By the law of total probability B.70, E^(E^ [Xirj(X) - aXi|£7]) = 0, for 
every 0* € G. Since U is complete for © G G, it follows that 

Ee^X^iX) - aX x \U] = 0, a.s. [P 0m ]. (4.125) 

So every unbiased level a test 77 must satisfy (4.125). Proposition 4.117 
says that (4.125) is equivalent to the derivative of the conditional power 
function with respect to 6\ at 9\ = 0? being zero. Subject to this condition 
and having Neyman structure, there is a test that maximizes the conditional 
power function at all 9 eft a according to Corollary 4.109. This test is then 
UMPU level a. □ 

Example 4.126. Suppose that (Xi, U) has a multiparameter exponential family 
and £l H = {0 : 61 < 0?} with £l A = {0 : 0i > 0?}. The conditional UMP level a 
test is 



{1 if x\ > d(u), 
y(u) if x\ = d(u), 
0 ifxi<d(u), 



where d and 7 are chosen so that the conditional size is a a.s. This test is UMPU 
level a by Theorem 4.124. 



270 Chapter 4. Hypothesis Testing 



Example 4.127 (Continuation of Example 4.121; see page 266). The usual two- 
sidedU-test is an example of Theorem 4.124. To see this, write the joint density 
of (X,S 2 ) as 

f (rAurt ^(^)^ 



x ex p("~ 2^2 ~ Mo - [m ~ Mo]) 2 + (n - l)s 2 ]) 
= r(/x,a)h(x, 5 2 )exp(^iv + ^2^) , 
where r(/z, cr) depends solely on the parameters and 

v = n(x — /uo), u = n(x — fio) 2 H- (n — l)s 2 . 

(Note that u is the observed value of the statistic U from Example 4.121.) The- 
orem 4.124 says that the UMPU level a test of H : Gi = 0 versus A : 6i ^ 0 is 
the conditional UMPU level a test of if versus A given (/. Note that 0i = 0 is 
equivalent to M = no. Since V has a one- parameter exponential family distribu- 
tion with natural parameter 6i given £/, the test will be a two-sided test of the 
form 



JL/ , x I 1 if v < d\ (u) or if v > < 
= | 0 ifdi(u)<t;<d2(u), 



where di(u) and d 2 (w) are chosen so that the conditional power function equals 
a at #i = 0 for all u and so that the derivative of the conditional power function 
equals 0 at 0i =0 for all u. As we saw in Example 4.121, the two-sided t- 
test has the above form with d\(u) = —Cy/u and cfc = Cy/u, where c > 0 is a 
constant. We also saw that the i-test has conditional level a given U. The fact 
that the derivative of the conditional power function is 0 follows, as in the proof 
of Theorem 4.124, from the fact that the partial derivative of the marginal power 
function is 0. Hence, the usual two-sided £-test is UMPU level a. 

One possible Bayesian approach is to put positive probability on Qh> This was 
done in Example 4.22 on page 224, where Bayes factors were computed. It is not 
the case that the usual two-sided £-test is equivalent to rejecting H when the 
Bayes factor is less than a constant. However, with a special type of improper 
prior, the two tests are equivalent. In Example 4.22, we showed that if A 0 -* 0 
and p 0 -> 0 in such a way that the ratio po/\Ao k, a constant, then the 
posterior odds in favor of the hypothesis converge to 



where t is the usual t statistic. 

Example 4.128. Suppose that X and Y are conditionally independent given 
T = (A, M) = (A,/x) with X ~ Poi(X) and Y ~ Poi(»). We wish to test H : A = 
2M versus A : A / 2M. Set G = {(A,/i) : A = 2/x}. Set = A = 2M for the 
subparameter space G, and note that 



fx t Y\vfay\il>) = exp(-l.fy)—^ 



v 



4.5. Nuisance Parameters 271 



for x, y = 0, 1, .... It follows that T = X + Y is a complete sufficient statistic for 
the subparameter space. The distribution of T is 

t (\y 

f T \r(t\X, fA) = exp(-A - /V £ fl!(t _ o) , » 

a=0 

for t = 0,1,.... The conditional distribution of (X,Y) given T and T is one- 
dimensional and can be represented by X alone. It is 

for x = 0, . . . , £. This is easily seen to be a one-parameter exponential family 
with natural parameter 8 = g(F) = log(A/M). The hypothesis can be written 
as H : 6 = log(2) versus A : 0 ^ log(2). All we need to show is that every 
a-similar test with level a has zero derivative for the conditional power function 
at 6 = log(2). To do this, first we reparameterize to 9 and M. Then 

fx,Y\eM*iV\0>») = exp(-/x[exp(0) + i]) ex ^ x6 ^ +V . 

xiyi 

Every unbiased a-similar level a test must have partial derivative of the power 
function with respect to 6 equal to 0 at every point (log(2),jz) in G, otherwise 
the power function would dip below a on the alternative. The partial derivative 
of the power function of a test <j> with respect to 6 is 



v)«cp(-Ai[expW + l]) eXp( *y V [x-»exp(0)} 

x=0 y-0 X '^' 

= EeAX4>(X,Y)) -/xexp(0)/M0, M ). 

Now, plug in 0 = log(2) and set this equal to 0. Note that 2/x is the mean of X 
and the power function at (log(2), fi) is a for every /i for an a-similar test. Hence, 
every a-similar level a test </> must satisfy 

0 = E log(2) , M (XftX, Y) - Xa) = E log(2))Ai (E K) - Xa\T)) , 

for all fi. Let = E(X0(X, Y) - Xa\T = t). Since T is complete for the sub- 
parameter space, E log ( 2 ), /x (/i(T)) = 0 for all implies h(T) = 0, a.s. [P\ og (2)^] for 
all /x. By Proposition 4.118, this is equivalent to the derivative of the conditional 
power function being 0 at 0 = log(2). 

None of the theorems of this section provides an essentially complete 
class of tests. To find an essentially complete class, we could piece together 
conditional tests given U = (X 2 , ...,X k ). For example, let ip be a test of 
H : ©i < 0j versus A : ©i > 0\. Let /fy(0i|u) be the conditional power 
function of rj) given U — u. For each u, there is a one-sided test of the form 

L if x\ > c(it), 
0(x) = ^ 7(u) if xi = c(u), (4.129) 
) if xi < c(u), 



272 Chapter 4. Hypothesis Testing 

which has maximum conditional power function for 9\ > 0? and minimum 
conditional power function for Oi < 0? subject to the power being /?y>(0?) at 
01 =0?. Since 0 in (4.129) minimizes and maximizes the power in precisely 
the right places uniformly in u among all tests in a class that contains it 
follows that R(9, 0) < R(0xp) for all 0 if a hypothesis-testing loss function 
is being used. It follows that tests of the form (4.129) form an essentially 
complete class. Other hypotheses can be handled in a similar way. 

4.5.3 Linear Combinations of Natural Parameters 

Most of the popular tests in the theory of normal distributions and linear 
models involve linear combinations of the natural parameters of an expo- 
nential family. In a fc-parameter exponential family with natural parameter 
6, let #i = X)jLi CiOi with ci ^ 0. Let * 4 = 9* for i = 2, . . . , fc, and set 
Y\ = Xi/ci and Y { = X { - aXi/ci for i = 2, . . . , k. Then Y = {Y u . • . , Y k ) 
has exponential family distribution with natural parameter 4>. If we want 
to test a hypothesis concerning $i, we can proceed as above. 

Example 4.130 (Continuation of Example 4.127; see page 270). The natural 
parameters are Gi = M/E 2 and G 2 = -1/[2E 2 ]. Testing M = fi 0 is then equiv- 
alent to testing 6i 4- 2/io02 = 0. So, we set = Bi -f 2/ioG2 and the usual 
two-sided t-test will result exactly as in Example 4.127. 

Example 4.131. Suppose that X and Y are conditionally independent given 
(A, M) = with X ~ Poi(X) and Y ~ Poi(fi). Suppose that H : A = M. 

In natural parameter form, 6i = log A and 02 = logM. So H : Gi - 82 = 
0. Set #1 = Gi - G 2 and #2 = G 2 . Let Zi = X and Z 2 = X + Y\ Then 
(Zi,Z 2 ) has exponential family distribution with natural parameter ^. We need 
the conditional distribution of Z\ given Z2. For 21 = 0, . . . , 22, and 22 = 0, 1 . . ., 
we have 

fz 1 ,z 2 \*{zi,z 2 \'il>) = c W ^j(~: exP^iV'i + £2^2), 

/z 2 |*(«2|^) = dW~yexp(^222), 
fzi\z 2 ,*( z i\ z 2i' l P) = r M , 22 ) exp(*i ^1 ) 

= ->(::) (^(:;)(^)W~'' 

for zi = 0,...,2 2 . So Zi given Z 2 = z 2 and (Gi,G 2 ) = (log A, log /z) has 
Bin(z 2 , A/(A + ji)) distribution. The UMPU level a test for the binomial pa- 
rameter equal to 1/2, conditional on Z 2 is the UMPU level a test of A = M. 

4.5.4 Other Two-Sided Cases* 

In Theorem 4.124, we saw how to deal with cases in which the hypothesis or 
alternative is two-sided in a natural parameter. But other two-sided cases 



This section may be skipped without interrupting the flow of ideas. 



4.5. Nuisance Parameters 273 



arise. 

Example 4.132. Suppose that X\ , . . . , X n are conditionally IID AT(/i, cr 2 ) given 
(M^E) = (/x, cr). This is an exponential family with natural parameters 0i = 
M/E 2 and 02 = -1/[2E 2 ]. Suppose that we wish to test H : a < M < b versus 
A : not H . We can rewrite H in terms of the natural parameters as 

H : Gi 4- 269 2 < 0 and 0i + 2a0 2 > 0. 

We can reparameterize to = 0 X + 260 2 and #2 = 0i 4- 2a0 2 , and the 
hypothesis becomes H : #1 < 0 and ^ 2 > 0. (If a < 6, the parameter space is 
^2) : V>i < ^2}.) This is not of any of the forms we have studied so far. 

Suppose that <j> is an unbiased level a test of H versus A. Then <j> must be 
a-similar. This means that /^(0,^ 2 ) = /^(V>i,0) = a for all ipi < ip 2 . It is not 
easy to construct nontrivial tests with these properties. (Of course, the trivial 
test <f)(x) = a for all x has these properties and is unbiased.) 

We can address this problem simply within the Bayesian framework. To stay as 
close to the classical solutions as possible, suppose that we use the usual improper 
prior, so that the posterior distribution of M is M ~ t n -i(x n , s n /n), where x n = 
Sr=i Xi l n anc * s ™ = Sr=i( Xi ~ ^n) 2 /(n - 1). The posterior probability that H 
is true is _ 

p = T n _, (VS^) - T.-, (v^^) • 

The formal Bayes rule with a 0-1-c loss would be to reject ififp<l/(l + c). 
To see what this test looks like, note that for each value of s n , p is a decreasing 
function of |fc|, where 

— _ a+b 

t = y/n— 2 —. (4.133) 

Sn 

In fact, 

,.r...(^-,)-r„(-^!-,). 

So the formal Bayes rule is to reject H if \t\ > d(s n ), where it is also easy to see 
that d(s n ) is a decreasing function of s n . 

One possible classical solution is to abandon the UMPU criterion and just use 
the usual t-test for testing that M = (a + b)/2. That is, reject H if \t\ > d, where 
t is defined in (4.133) and d is determined to make the level a. Unfortunately, 
the conditional distribution of y/n(X n - [a + b]/2)/S n given (M,E) = (/i,cr) is 
noncentral t NCt n -i(y/n[fi-(a-\-b)/2]/a). The power function of the usual £-test 
at (//,a) for /x ^ (a + b)/2 goes to 1 as a goes to 0 for fixed /i. Hence the usual 
<-test has level 1 as a test of H versus A. To get a test with level a < 1, one 
could let d depend on S n as in the formal Bayes rule (although one should note 
that the formal Bayes rule also has level 1 — the problem occurs as cr — ► 00 for 
the formal Bayes rule). Calculating the power function of the resulting test would 
require a separate two-dimensional integration over the space of (x n ,Sn) values 
for each (/x, cr) pair. 

Another classical solution would be to add together the UMPU level a/2 test 
of H' : 0 < b versus A' : 0 > b and the UMPU level a/2 test of H" : 0 > a 
versus A" : 0 < a. It can be shown (see Problem 54 on page 292) that this 
test has size a. 37 The power function is easy to calculate. It is just the sum of 



This test is the likelihood ratio test, to be defined in Section 4.5.5. 



274 Chapter 4. Hypothesis Testing 



the two power functions for the two one-sided tests. If a < 6, this test is biased, 
since there exists 0 £ CIa such that the power function at 0 is close to a/2. Note, 
however, that if a = 6, then this test is exactly the same as the UMPU level a 
test of 0 = a versus G ^ a. 

One could change the hypothesis in Example 4.132 to make it more like 
Example 4.127. 

Example 4.134 (Continuation of Example 4.127; see page 270). Suppose that 
we wish to test if : M 6 [/xo + a£, fio + 6E] versus A : M £ [fio + a£, fio + fe£]. 
This is a test about a linear combination of natural parameters, namely \I>i = 
Bi 4- 2/ioB2- The hypothesis is H : #1 G [a, 6] versus >1 : \£i [a, 6]. We can 
let #2 = 62 = — 1/[2£ 2 ]. We would now need to work with the conditional 
distribution of V = n(X — /xo) given £/ = n(X — /io) 2 + (n — 1)5 2 . This conditional 
distribution will have a density equal to a constant (function of u and ip\ ) times 
exp(— vxpi)(u — v 2 /n)^ n ~ 1 ^ 2 ~ l for 0 < v < y/nu. For each value of u, one would 
have to find di(u) and efc (w) so that 

a = Pr (n[X - /x 0 ] g [di(ii),d 2 (u)]| = u, ^1 = ^i) , 

for ipi = a and for t/>i = 6. Of course, one could wait until the data were observed 
and then do it only for the observed value of U. 



4.5.5 Likelihood Ratio Tests 

A popular approach to forming tests, when no obvious UMP or UMPU test 
is available, is to use the likelihood ratio criterion (LR). The idea is to start 
with the Neyman-Pearson concept of likelihood ratios. In the Neyman- 
Pearson setup, the likelihood under H is just a single number, as it is under 
A. In general, however, the likelihoods are functions of 9. In the Bayesian 
approach, we integrate those functions with respect to a distribution. In 
LR tests, we maximize those functions over 0. The LR criterion is 

LR - sup * € °" fx\e{X\0) 
sup0 G n/x|e(*|0) ' 

To test a hypothesis H using the LR criterion, choose a number c and 
reject H if LR < c. The number c is usually chosen to make the level of the 
test equal to some prespecified value. Sometimes the distribution of LR is 
recognizable, and sometimes it is not. If the distribution is not recognizable, 
then an approximate distribution is provided in Section 7.5.1. 

If ™?oenfx { e(x\0) = sup, €fU fx\e(x\0), then LR is the same as the 
approximate Bayes factor in (4.24) when X = x is observed.** If, in addi- 
tion, Q H is a single point, then LR is the same as the global lower bound 
on the Bayes factor (4.20). 

38 For example, if every point in Q H is the limit of a sequence of points in tt A 
and fx\e(x\0) is continuous in 9, this condition will hold. 



4.5. Nuisance Parameters 



275 



Example 4.135. Suppose that Xi,...,X n are conditionally IID with Ber(0) 
distribution given 9 = 0. Let the hypothesis be H : 0 = 0 O versus A : 9 ^ 0 O . 



since 6 Y (1 - 0) n ~ Y is largest if 0 = Y/n. The LR test would be to reject H if 
LR is smaller than some specified value. As a function of V, LR increases until 
Y/n reaches Oq and then decreases. For example, if 0o = 1/4 and n = 10, we 
have a case similar to Example 4.112 on page 262. The UMPU level a = 0.05 
test of if : 9 = 1/4 was found there to be a test that rejected H if Y > 6 and 
randomized if Y € {0,5}. If Y = 0 is observed, then LR = 0.0563. If Y = 6 is 
observed, then LjR = 0.0647. It follows that the level a = 0.05 LR test is not 
the same as the UMPU level 0.05 test. The reason is that no LR test can reject 
for Y = 6 without rejecting also for Y — 0, since LR is smaller at Y = 0 than 
at Y = 6. The level 0.05 LR test will reject H for Y > 7 and will randomize if 
Y = 0 with probability 0.8259 of rejecting H. Note that the LR test is of the 
form of Theorem 4.104, so that it is admissible, but it is not UMPU. 

Example 4.136. Suppose that Xi,...,X n are conditionally IID given 9 = 
(M, E) = (/x,cr) with AT(/i,or 2 ) distribution. Suppose that H : M = fjL 0 is the 
hypothesis. The formula in (4.26) gives the observed value of LR, namely (1 4- 
t 2 /[n - l])" n/2 , where t is the test statistic for the usual t-test. Since LR is a 
decreasing function of \t\, the level a LR test will be the same as the UMPU level 
a test for all a. 

As mentioned earlier, the main reason for introducing LR tests is that 
they can be used in situations in which UMPU tests do not exist. In the 
following example, snp een f X \e(x\0) is not equal to sup eeQA fx\e(x\0) for 
all possible data values x, so that there will exist data sets for which LR 
is not equal to the approximate Bayes factor (4.24). 

Example 4.137. Suppose that Xi, . . . , X n are conditionally IID N(fi, a 2 ) given 
(M,E) = (/i, cr), and H : a < M < b versus A : not H. We can easily calculate 
the LR criterion: 



We can easily see that the level a LR test will be the sum of two one-sided tests. 
Since Problem 54 on page 292 shows that the level of the sum of two one-sided 
tests in this problem is the sum of the levels, and since LR decreases equally fast 
as X drops below a or rises above 6, it follows that the two tests should each have 
level a/2. The LR test becomes the test described at the end of Example 4.132. 

In Section 7.5.1, we will prove some large sample properties of LR tests. 



Lety = £t n =i**- Then 




1 



if Xe [a, b] 




276 Chapter 4. Hypothesis Testing 



4.5.6 The Standard F-Test as a Bayes Rule* 

In many of the examples in this chapter, we compared various classical tests 
to Bayesian procedures. In particular, we found that by using improper 
priors, we could make many classical tests into formal Bayes rules. For the 
case of normal distributions, it is actually possible to find a proper prior 
distribution that leads to a similar result. Consider one of the general linear 
models, such as analysis of variance or regression. In this section, we will 
find a proper prior distribution for the parameters such that the standard 
F-test emerges as a Bayes rule and is seen to be admissible with 0-1-c loss. 

General linear models can be transformed into the following. The pa- 
rameters are 6 = (M,*,E), with M an r-dimensional vector and # an 
s-dimensional vector. Let k —- s + r. The sufficient statistics are (Y, W, (7), 
which are conditionally independent given the parameters with distribu- 
tions 

F~ iV r (/i,<7 2 /), W~a\l U~N s (il>,a 2 I). (4.138) 

Example 4.139. Suppose that Xi,<, i = 1, . . . , m are IID AT(/zi, a 2 ) independent 
of A*2,i, i = 1, . . . , ri2 with 7V(/i2, 0" 2 ) distribution given Mi = /xi, M2 = /i2, £ = a. 
A popular hypothesis is H : Mi = M2. One can write this as H : Mi — M2 = 0. In 
the above notation, r = 1, Y is the difference between the two sample averages 
times ^711712/(711 + 712), and M = Mi - M2. Also, s = 1, U is the sum of all the 
observations divided by y/m + n 2 , and # = (niMi + 7i2M2)/\/™i +^2. Finally, 
d = Tii + 712 — 2 and W is the pooled sum of squared deviations. 

Example 4.140. Suppose that Yi,...,Y n are conditionally independent given 
B = /9, E = a with Yi ~ N(xjj3,a 2 ) distribution, where the Xi are known 
fc-dimensional vectors. This is the standard linear regression model. A typical 
hypothesis is of the form H : CB = c versus A : CB ^ c, where C is an r x k 
matrix of rank r < k and c is an r-dimensional vector. Define the matrix G = 
xtxj, and assume that G is nonsingular. The usual least-squares estimator 
of B is B = G~ l YTi^\ XiYi ' Its conditional distribution given the parameters 
is N k (P,a 2 G~ l ). The conditional distribution of CB given the parameters is 
N r (CP,a 2 CG- 1 C T ). Let D be a (k - r) x k matrix whose k-r rows are all 
orthogonal to the r rows of CG~ l . The sufficient statistics can then be written 
as 

n 

y = (CG" 1 c7 T )-icB, w = J](y < -x, T B) 2 , u = (dg- 1 d t )-Idb. 

i=l 

With # = (DG~ 1 D T )~ 1/2 DB, M = (CG- 1 C T )- l/2 CB, s = fc-r, and d = n-fc, 
we see that (K, W, C/) have the distributions given in (4.138). 

For the general situation, construct new random variables B > 0, T G HT, 
and H £ {0, 1}, which are conditionally independent of (F, W, U) given 0. 



*This section may be skipped without interrupting the flow of ideas. 
39 If k = r, we don't need the U vector. 



4.5. Nuisance Parameters 277 



The distribution of 9 and the new parameters is as follows. Given $ = ip, 
E = cr, B = /?, T = 7, and H = h; M = fry 0/(1 4- 7 T 7) with probability 1. 
Given E = cr, B = /?, T = 7, and H = h, 



^ ~ N< 



(°'T^ 7 ) 



Given B = /3, T = 7, and if = /i, E = +7 T 7 with probability 1. 

This makes the conditional distribution of \£ into AT s (0, (1 - cr 2 ) J), which 
we shall call P a . Given B = /? and H = h, the density of T is 

fr\BM (l\0,h) = ithAl) 

j c 0 (l+7 T 7)-^ iffc = 0, 

I Cl (/3)(l+ 7 T 7)-^exp(j^) if fc=l. 

Finally, B and if are independent, with B having some density /(/?) strictly 
positive on all of (0, 00), and Pr(H = 0) = po- 

The Bayes rule with respect to 0-1-c loss is to reject H = 0 if Pr(H = 
\\Y — y,U = u, W = w) is large. 



Pr(# = 1\Y = y,U = %W = w) 



(4.141) 



Pr{H = 0\Y = y,U = u,W = w) 

f I f f fY,u,w\e(y, % ^> a)dP a {il;)dQi n ^{a, ti)7r h(3 (~f)d'yf(p)df3 
fin fv,u,w\e{y, % H/*> ^)dP £T (^)dQo, 7 (^ ^oAl^f^W ' 
where £Ji, 7 ,/3 is the distribution for (M,E) that puts all of its mass on 
|i = 7^/(1 + 7 T 7) and cr = l/^/l + 7 T 7; Q 0)7 is the distribution of 
(M,E) that puts all of its mass on fi = 0 and cr = l/y/l + 7 T 7; and 
/y f r/ f w|e(»>^»^K^>^) is proportional to 



^-a-r-d^f-i exp . 



1 

*2cT2 



it; 



+ ^( 2 / i -/i i ) 2 + ^(%-^) 2 



i=i 



To find the Bayes rule, we begin with the innermost integration (over ip) 
in both numerator and denominator (since they are the same). The integral 
to be performed is 



2 exp^-^ dtp, 



where 



(J 2 



s 8 ( 

E^ 2 = E{« 

j=l j=l 1 



+ (7 2 (1-<T 2 ) J 



278 Chapter 4. Hypothesis Testing 

It follows that <j(1 - a 2 ) 1 / 2 is a scale factor for each coordinate of ip. Hence 
the integral is proportional to exp(-u T u/2). Since this depends on the 
data alone, it cancels out of the numerator and the denominator along 
with w d t 2 ~ l and the constant in the data density. What remains of (4.141) 
is 



/ / / a-*-* exp (£[w + £[ =1 (2/i - Mi ) 2 ]) dQ 0 „(a, n)**di)*lf&W ' 



The next innermost integrals are with respect to point mass probabilities, 
so they merely evaluate the integrands at the points where they put their 
mass. The result is 




/ /(l + 7 T 7)^ exp {-±±£2(«> + „T„)} Wo ^) dl f((3)d(3 



Next, we integrate over 7. In the denominator, we get 




Call this last expression K. In the numerator, we get 




c?7 



So, the ratio is 



constant x / Ci (/?)/(/?) exp 



2 y T i/+ii> J ^ 




2/ T 2/ _ rF 



y T y + w rF + d' 



4.6. P-Values 279 



where F is the classical F statistic. Hence the Bayes rule is to reject H 
when F > c for some number c, which is the classical F-test. Because of 
the way the prior distribution was constructed, we can show that the Bayes 
rule is admissible (see Problem 60 on page 294), so the F-test is admissible. 
Notice that the prior distribution depends on the sample size, so it could 
not be used as a real prior distribution unless we knew for sure what the 
sample size would be before observing the data. 

4.6 P-Values 

4.6.1 Definitions and Examples 

A common criticism of hypothesis-testing methodology is that the decision 
to "reject" or "accept" a hypothesis is not informative enough. One should 
also provide a measure of the strength of evidence in favor of (or against) 
the hypothesis. The posterior probability of the hypothesis is an obvious 
candidate to provide the strength of evidence in favor of the hypothesis, 
but the posterior is not available in a classical analysis. In fact, there is 
no theory for strength of evidence or degree of support in the classical 
theory. Instead, some alternatives to testing hypotheses are available. The 
alternative considered here is to give the set of all levels for which a specific 
hypothesis would be rejected. 40 For most of the tests that we will consider 
in this book, the set of a values such that the level a test would reject H 
will be an interval starting at some lower endpoint and extending to 1. The 
lower endpoint will be called the P -value of the observed data relative to 
the collection of tests. 

Definition 4.142. Let if be a hypothesis, and let T be a set indexing 
nonrandomized tests of H (i.e., {</> 7 : 7 G T} is a set of nonrandomized 
tests of H). For each 7 € T, let ^(7) be the size of the test </> 7 . Define 

p H {x) = inf{</>(7) : 0 7 (x) = 1}. 

We call ph(x) the P -value of the observed data x relative to the set of tests 
for the hypothesis H. 

Usually, when the data have continuous distribution, we can arrange for 
T = [0, 1] and </?(7) = 7. That is, there is one and only one size 7 test in the 
set for each 7 G [0, 1]. If the data have a discrete distribution, it may be 
impossible to achieve certain sizes with nonrandomized tests. Often, it is 
understood implicitly which set of tests is under consideration and what is 



40 Another alternative is to provide interval (or set) estimates for parameters 
(see Section 5.2). For example, a coefficient 1 — a confidence set (Definition 5.47) 
is defined in such a way that it contains all of the values of 6 such that the 
hypothesis H : 0 = 0 would be accepted at level a (see Proposition 5.48). 



280 Chapter 4. Hypothesis Testing 



the hypothesis. In these cases, ph(x) is called the P- value without reference 
to the set of tests or the hypothesis. 

Example 4.143. Suppose that X ~ N(0 y l) given 6 = 0 and Hi : 6 € 
[-0.5,0.5]. The UMPU level a test of Hi is <j) a (x) = 1 if |x| > c a , for some num- 
ber c a . If X = 2.18 is observed, c/> a will reject H\ if and only if 2.18 > c a . Since c a 
increases as a decreases, the P- value is that a such that c a = 2.18. If c a = 2.18, 
then the test is </> a (x) = 1 if |x| > 2.18, so a = $(-2.68) + 1 - $(1.68) = 0.0502 
is the P- value. 

The reader will note that we used the same notation Ph{x) to denote the 
P- value as we used for the significance probability in Definition 4.8. The 
reason is that they are almost always the same thing. 

Proposition 4.144. Let {0 7 : 7 € T} be a collection of tests. Suppose that 
r C [0, 1] and 71 > 72 implies that for all x, 0 7l (x) > <t> l2 (x). Define the 
binary relation ■< on the sample space by x ■< y if and only if the P -value 
for x is at least as large as the P -value for y. Then ■< is a weak order and 
the P-value always equals the significance probability. 

The conditions of Proposition 4.144 say that for every possible observa- 
tion x, if x leads to rejection of H at one level, then it leads to rejection at 
every higher level. This is just a precise way of saying what we said earlier 
about the set of levels at which a hypothesis would be rejected being an in- 
terval running from some lower bound up to 1. Although the conditions of 
Proposition 4.144 are met in most situations, the following example [from 
Lehmann (1986) 41 ] is a case in which they are not. 

Example 4.145. Suppose that Q = {1,2,3} and X = {1,2,3,4}. Consider the 
following conditional distribution for X given 6: 



1 


2 


3 


4 


2 


4 


3 


4 


13 


13 


13 


13 


4 


2 


1 


6 


13 


13 


13 


13 


4 


3 


2 


4 


13 


13 


13 


13 



Consider the hypothesis H : 6 < 2 versus A : 0 = 3. One can show that the MP 
level 5/13 test of H is fo/iafr) = 1 if * € {1, 3} and that the MP level 6/13 test 
is 06/13 (*) = 1 if * G {1,2}. For a = 1, 4> a (x) = 1 for x € {1,2,3,4}. So X = 3 
leads us to reject H at some high values of a and at some low values of a, but not 
at certain values in between. The infimum of the set of all a such that 0 a (3) = 1 
no longer tells us all of the levels at which we would reject H. In particular, one 
of the conditions of Proposition 4.144 is violated. 

Because P-values are between 0 and 1 and because the smaller the P- 
value is the smaller a would have to be before one could accept the hypoth- 
esis, people like to think of the P-value as if it were the probability that the 



41 The example appears in Problem 34 on page 121 of that text and in Prob- 
lem 29 on page 116 of the 1959 edition. 



4.6. P-Values 281 



hypothesis is true. Those who are more careful with their terminology will 
still suggest that it is the degree to which the data support the hypothesis. 
Sometimes this is approximately true, as in the next two examples. 42 

Example 4.146. Suppose that X ~ Bin(n,p) given P = p, and let H : P < po. 
The UMP level a test rejects H when X > c a , where c a increases as a decreases. 
The P- value of an observed value x is the value of a such that c a = x — 1 unless 
x = 0, in which case the P- value is 1. The P- value can then be calculated as 



This formula is also correct when x = 0. 

Next, suppose that we used an improper prior for P of the form £teta(0, 1). 
The posterior distribution of P would be Beta(x, n+l-x). If x > 0, the posterior 
probability that H is true is Pr(y < po), where Y ~ Beta(x,n + 1 — x). This is 
the probability that at least x out of n IID C/(0, 1) random variables are less than 
or equal to po, because Y has the distribution of the xth order statistic from a 
sample of n IID C/(0, 1) random variables. The probability that a single U(0, 1) 
is less than or equal to po is po, and the n of them are IID so the probability 
that at least x of them are less than or equal to po is 



So, the P-value is equal to the posterior probability that the hypothesis is true 
(using an improper prior), at least when the posterior is proper. If x = 0, then 
the posterior is still improper Beta(0,n -I- 1). 

Consider next what happens if H : P > po- It turns out that the improper 
prior must change to £eta(l,0). (See Problem 64 on page 294.) Because two 
different priors are needed to obtain the "degree of support" for the two different 
hypotheses, we get the following anomaly. If we take the two hypotheses together, 
{P ^Po}U{P >po}, the total degree of support is 



One can easily check that this is not due to the fact that {P = po} is included in 
both hypotheses. One could leave it out of either one and the results would be 
the same. 

A similar situation occurs with Poisson data. 

Example 4.147 (Continuation of Example 4.61; see page 241). The P-value of 
an observed data value x is Pr(X > x\S = 1). This can also be written as 

pn(x) = Pr(at least x events in one time unit of a rate 1 Poisson process) 
= Pr(time until xth event is < 1) = Pr(Y < 1) = Pr(G < 1\X = x), 



42 Berkson (1942) carefully examines the use of significance probabilities as 
evidence in favor of hypotheses. 

43 There is a sense in which Pr(P < po\X = 0) = 1 even in this case, but it 
requires the notion of finitely additive probability. 




i=x 




1 + 




n—x 



282 Chapter 4. Hypothesis Testing 



where Y ~ 1) and assuming that the "prior" for 8 is the improper dO/0. So, 
the P-value is the posterior probability that H is true if the prior is improper. 
This is actually true in general for Poisson distributions and hypotheses of the 
form H : Q < Oq. The implications of this result include the following. If a 
Bayesian uses the improper prior dO/6 and has a 0-1-c loss function, then he or 
she will reject H if the P-value is less than 1/(1 + c). This is the UMP level a 
test if a = 1/(1 -he). 

Next, suppose that H : 8 > 1 and A : 8 < 1. The UMP level a test is 

(1 if x < c, 
7 if x = c, 
0 if x > c, 

where c and 7 are chosen so that <t> has size a. The P-value of an observed data 
value x is Pr(X < x|8 = 1). As we did earlier, we can write this as 

Ph(x) = Pr(at most x events in one time unit of a rate 1 Poisson process) 

= Pr(time until event x + 1 is > 1) = Pr(Y > 1) = Pr(8 > l\X = x), 

where Y ~ T(x + 1, 1) and assuming that the "prior" for 8 is the improper dO. 

If we modify Example 4.143 slightly, we discover a case in which it is 
simply impossible to use P- values for measuring degree of support. [See 
also Schervish (1996).] 

Example 4.148 (Continuation of Example 4.143; see page 280). Let H2 : 8 € 
[-0.82,0.52]. The UMPU level a test is ip a {x) = 1 if |x + 0.15| > d a . If X = 2.18 
is observed, then d a = 2.33 and 

a = $(_3) + 1 - $(1.66) = 0.0498. 

This is smaller than the "degree of support" for the smaller hypothesis Hi . It 
does not make any sense to have a concept of degree of support that gives more 
support to a smaller hypothesis than it gives to a larger one. In the one-sided 
testing case, this does not happen. (See Problem 62 on page 294.) 

In Example 4.148, we saw that the P-value of a data value relative to the 
class of UMPU tests behaved strangely as the hypothesis varied. This exam- 
ple is closely related to the incoherent tests discovered in Example 4.88 on 
page 252 and Example 4.102 on page 257. Problems 61 and 62 on page 294 
show that for one-sided hypotheses with known or unknown variance, the 
P-value always equals the posterior probability of the hypothesis calcu- 
lated from the usual improper prior. In the case of interval hypotheses 
with unknown variance, the situation is somewhat different. 

Example 4.149. Suppose that X u . . . , X n are conditionally IID with AT(/i, a 2 ) 
distribution given 9 = (/x,<r). For a hypothesis of the form H : a < M < b with 
a < b we have not found UMPU tests. We do, however, have the collection of 
likelihood ratio (LR) tests. (See Examples 4.132 and 4.137.) The UMPU tests of 
one-sided hypotheses (like M < b or M > a) and point hypotheses (like M = c 
versus M ^ c) are also LR tests. So, we might try to compare the P-values 
for various hypotheses relative to the families of LR tests. If X = x > b and 



4.6. P-Values 283 



S 2 = - X) 2 l{n - 1) = s 2 are observed, the P-value for H : a < M < b 

will be the level of the LR test that rejects H when y/n(X — b)/S> >Jn{x — b)/s 
or when y/n(X — a)/S < —y/n(x — b)/s. Call this P-value p//(x). It is easy to see 
that ph(x) is precisely the same as the P-value for the hypothesis Hb : M = b 
versus A b : M ^ b relative to the collection of UMPU (two-sided) tests. Also 
Ph(x) is precisely twice the size of the P-value for the hypothesis H' b : M < b 
versus A' b : M > b relative to the collection of UMPU (one-sided) tests. 

Since the one-sided P-values equal posterior probabilities of the hypotheses 
when using improper priors (see Problem 62 on page 294), we find that the one- 
sided P-value for the hypothesis H' c : M < c is a continuous function of c. It 
follows that there exists c > b such that the one-sided P-value for H' c satisfies 
Ph(x) > Ph'(x) > Ph(x)/2. If we are to interpret the P-values relative to the 
collections of LR tests as degrees of support for the respective hypotheses, then 
if x > 6, the degree of support for every hypothesis of the form H a : M € [a, 6] 
(for varying a but fixed 6) is the same number ph(x) (even if a = 6 or if a is 
much less than b). But the degree of support for the hypothesis H' c : M £ (— oo, c] 
(where c > 6 is chosen as above) is p H ' c (x) < ph(x). In words, the data offer more 
support for every hypothesis of the form M E [a, 6] than they do for M G (oo, c] 
even though [a, 6] C (~oo,c] for every o. 

We see that there are cases (usually one-sided testing) in which P-values 
can correspond to a degree of support for the hypothesis, but there are 
other cases (e.g., two-sided alternatives) when they cannot. It is possible, 
for example, with normal data, to express certain P-values as weighted 
averages over the corresponding hypotheses of P-values for testing point 
hypotheses. (See Problem 66 on page 294.) To generalize this idea beyond 
normal distributions (or symmetric location families), one needs to con- 
sider tests that may not be UMPU. Spj0tvoll (1983) defines a measure of 
"acceptability" of point hypotheses, which has the property that for many 
distributions and certain hypotheses, the weighted average of the accept- 
ability over the hypothesis equals something closely related to the P-value. 

In Section 6.3.1, we give some general conditions under which P-values 
are equal to the posterior probabilities of hypotheses. Casella and Berger 
(1987) study the problem of testing a one-sided hypothesis-alternative pair 
and find that in many cases, the P-value is approximately a limit of pos- 
terior probabilities. Examples 4.143 and 4.148 point out that the P-value 
cannot be taken as a method for providing a "degree of support" for general 
hypotheses, however. 

4.6.2 P- Values and Bayes Factors 

In Section 4.2.2, we introduced Bayes factors as ways to quantify the degree 
of support for a hypothesis in a data set. In particular, there are lower 
bounds on Bayes factors which indicate the smallest amount of support 
one could coherently say that the data supply to the hypothesis. When 
the lower bound is not particularly small, one would be hard pressed to 
argue that the data are highly inconsistent with the hypothesis. Since P- 
values have also been suggested as measures of support for the hypothesis 



284 Chapter 4. Hypothesis Testing 



the data offer, it seems natural to compare the two. In one-sided cases, 
we found that posterior probabilities (when using improper priors) often 
corresponded to P- values. In this section we will only compare Bayes factors 
to P- values for testing hypotheses of the form H : 6 = S 0 versus A : 6 ^ 6 0 . 
Edwards, Lindman, and Savage (1963) and later Berger and Sellke (1987) 
made comparisons of P-values with lower bounds on Bayes factors, and the 
following two examples are inspired by the presentations in those sources. 

Example 4.150. Suppose that Xi,...,X„ are conditionally IID with N(0, 1) 
distribution given 9 = 0, and we are interested in testing H : 9 = 0 O . Let 
po = Pr(9 = 0 O ) > 0. If we let the prior distribution A of 9 given that 9 ^ 0 O be 
unrestricted, then the global lower bound on the Bayes factor is easily calculated 
to be exp(-n[x - 0o] 2 /2). The lower bound on the Bayes factor for A being a 
normal distribution centered at 0 O is 1 if |x - 0 O | < 1/y/n, and it equals 

Vn\x - 0 O | exp ~ 0o) 2 + ^) 

if \x - 0 O | > l/y/n. The UMPU level a test of H is to reject H if y/n\x - 
0o| > $ 1 (l - a/2), where <$> is the standard normal CDF, so the P-value for an 
observation x is 1 - 2^(^n\x - 0 O |). All of these, the P-value and the two lower 
bounds, are monotone decreasing functions of y/n\x — 6o\. Table 4.152 compares 
the two lower bounds with the P-value and lists what the prior probability of 
the hypothesis would have to be in order for the posterior probability to be as 
low as the P-value. Notice how small the prior probability would have to be in 
order for there to exist even a prior distribution on the alternative which would 
allow the posterior probability to equal the P-value. For the normal distribution 
priors and small P-values, the required po is quite small. 

The discrepancy between the P-value and posterior probability of the 
hypothesis, as described in Example 4.150, is sometimes called "Lindley's 
paradox" [see Lindley (1957) and Jeffreys (1961)]. The contrast between P- 
values and posterior probabilities is even more striking when one considers 
more reasonable prior distributions on the alternative rather than the lower 
bounds. 

Example 4.151. Suppose that X\, . . . , X„ are conditionally IID with iV(/x,(7 2 ) 
distribution given (M,E) = (/i,<r). Suppose that we use a prior distribution that 



Table 4.152. Comparison of P- Values and Lower Bounds in Example 4.150 





Global 


Normal 


P-value 


Bound 


Prior a 


Bound 


Prior a 


0.1 


0.2585 


0.3006 


0.7011 


0.1368 


0.05 


0.1465 


0.2643 


0.4734 


0.1001 


0.01 


0.0362 


0.2179 


0.1539 


0.0616 


0.001 


0.0045 


0.1835 


0.0242 


0.0398 


0.0001 


0.0005 


0.1622 


0.0033 


0.0293 



a This is the largest possible value of p 0 which is consistent with the posterior 
probability being equal to the P-value. 



4.7. Problems 285 



is conjugate as in Example 4.22 on page 224. The Bayes factor for the hypothesis 
H : M = /io was given in (4.23). Consider what happens to this expression as 
n — ♦ oo. First, suppose that the usual t statistic converges to a constant to. That 
is, assume that y/n(x n — fJ>o)/s n converges to to, where s„ = w/(n — 1). In this 
case, the formula in (4.23) behaves asymptotically (as n — ► oo) like y/n times 
a constant. This means that the Bayes factor goes to oo as n increases, hence 
the posterior probability of the hypothesis goes to 1. What happens to the P- 
value for this same sequence of data sets? Since the t statistic is converging to a 
constant to, the P- value is converging to 1 -2$>(to). For example, if to = 1.96, the 
P- value will go to 0.05, while the posterior probability of the hypothesis goes to 
1. This is an extreme example of Lindley's paradox. Once again, the suggestion 
of Lehmann (1958) to let a decrease as n increases would seem appropriate. 

The situation is much the same if one uses the approximate Bayes factor as 
calculated in (4.29). This formula does not require that the prior be natural 
conjugate, but merely smooth in some sense. Since both 4* and 9 converge to 
finite values almost surely given 6 (by the strong law of large numbers 1.62), the 
expression in (4.29) also behaves asymptotically like y/n times a constant if the 
t statistic converges to to. 

Of course, the t statistic will converge to a finite value with positive probability 
only if the hypothesis is true. So, it is comforting that virtually any smooth prior 
distribution will lead to eventually discovering that the hypothesis is true, if it is 
indeed true. 44 On the other hand, it is a bit disconcerting that the P-value will 
stay bounded away from 1 with positive probability no matter how much data 
we observe. (See Problem 65 on page 294 to see how to prove that the P-value 
has £7(0, 1) distribution given that the hypothesis is true.) If the hypothesis is 
false, it is easy to check that the P-value will go to 0, as will the Bayes factor. 

The irreconcilability of P- values and posterior probabilities as illustrated 
in Examples 4.150 and 4.151 is quite typical of cases in which Qh is a 
lower-dimensional set than Q. [See Schervish (1996) for some examples 
with distributions other than normal.] Together with the strange behavior 
of P- values in Examples 4.148 and 4.149, it becomes difficult to justify their 
use to measure strength of evidence in favor of the hypothesis for two-sided 
problems. 

4.7 Problems 

Section J^.l: 

1. Prove that the loss function in (4.2) is equivalent to a 0-1-c loss function. 
(By "equivalent" we mean that both the posterior risk and the risk function 
will rank all decision rules the same way regardless of which loss is used.) 

2. Prove that the general form of the hypothesis- testing loss function (in Defi- 
nition 4.1) can be written as d(9) times a 0-1 loss function for some function 
d > 0 if the loss is 0 whenever a correct decision is made. 



In Section 7.4, we will prove some results that make more precise this limiting 
ability of posterior distributions to identify the value of a parameter. 



286 Chapter 4. Hypothesis Testing 



3. Let X = (Xi,...,X n ) be such that the Xi are conditionally IID with 
AT(0,cr 2 ) distribution given E = a under the hypothesis H. Let T(x) = 
v/nx/s, where x = £" =1 xi/n and s 2 = X7=i( Xi ~ ^)V( n " Define 
x X y if T(x) < T(y). Find ph(z), showing that it is the same for all a. 

Section 4-2: 



4. Suppose that X ~ JV(0, 1) given 0 = 0. Suppose that L(0, 1) < L(0,O) for 
all 0 < 0 O and that L(0, 1) > L(0,O) for all 0 > 0 O . Prove that, for every 
prior there exists k such that the formal Bayes rule will be to choose action 
a = 1 if X < k. 

5. Suppose that Xi, . . . , X n are conditionally IID with AT(/a, a 2 ) distribution 
given 6 = (/z, a). Use the improper prior having Radon-Nikodym deriva- 
tive 1/cr with respect to Lebesgue measure on (0, oo) x H. Let fj,o and d 
be known values, and suppose that the loss function is 



9,a) = | 



c if a = 1 and |/i — //o| < rfo", 
L(0, a) = ^ 1 if a = 0 and j/x - /io| > da, 
0 otherwise. 



Prove that the formal Bayes rule will be of the following form: Choose 
a = 1 if |T| > k for some constant k, where T = y/n(X n - ^o)/S n and 

1 n 1 n 

X n = — Xi, S 2 = / (^i — X n ) . 

i=i i=i 

6. Suppose that X ~ iV(0, 1) given 9 = 0 and 0 has Pr(0 = 0 O ) = po and 
given 0 ^ 0o, 0 ~ N(9 0 ,t 2 ). Prove that the posterior density of 0 with 
respect to the measure u(A) = Ia(0o) + \(A), where A is Lebesgue measure, 
is given by 

( Pl if0 = 0o, 

/e,x(*l*)=| ^^ exp [__^ ( ,_, l)2 ] if 0^0 O) 



where 



XT 2 + 00 

#i = —7—. — y~ 

1 4" T 

Pi _ Po 



1-pi (1-Po) 



v^-p{4[tT^] ( ^-^ )2 }- 



7. Suppose that X ~ AT(0, 1) given 0 = 0. Let H : 0 = 0o and A : 0 ^ 
Let the conditional prior given 0 / 0o be 7V(0o, r ). 

(a) Prove that the Bayes factor is minimized if 



a f (x-0o) 2 -l ifk-0o|>l, 
r ~| 0 otherwise. 



4.7. Problems 287 



(b) Show that the minimum Bayes factor is |x-0o| exp({-[x-0o] 2 + l}/2) 
if \x - 0 O | > 1, and is 1 if |x - 0 O | < 1. 

8. In Example 4.21 on page 224, prove that 



Section 4-3.1: 

9. Let (ao, ol\) be a point on the lower boundary of the risk set for a simple- 
simple hypothesis-testing problem. Prove that ao 4- ol\ < 1. 

10. In a simple-simple hypothesis-testing problem, prove that the minimax 
rule for a 0-1 loss function is any test that corresponds to the point where 
&l intersects the line y = x. 

11. Let Q = {0, 1}. Suppose that P 0 says that X ~ U(-y/3, y/S) and Pi says 
that X ~ N(0, 1). Let £l H = {0}. Draw the risk set for a 0-1 loss function, 
and find the minimax rule. 

12. In a simple-simple hypothesis-testing problem with 0-1 loss, show that a 
MP level a test has size a unless all tests with size a are inadmissible. 

13. Return to the situation in Problem 28 on page 212. Consider the hypothesis 
H : X ~ fo versus A : X ~ fi. Find all a such that the MP level a test is 
of the form "Reject H if —d < X < d," and write d as a function of a. 

14. *Prove Proposition 4.46 on page 235. 

15. In Example 4.49 on page 236, prove that the Bayes rule with size a has 
higher power than the unconditional size a test. 

Section 4.3.2: 

16. Suppose that the loss function is 0-1-c, that Qh = {#o}, and that Q = 
{0o } U Qa. Prove that a UMP level a test has no larger Bayes risk than 
any other size a test, no matter what prior we use. 

Section 4.3.3: 

17. Let Y = |X| where fx\e(x\0) = 1/(tt0[1 + (x/0) 2 ]). Suppose that 0 > 0 
for sure. Prove that the family of distributions for Y has MLR. 

18. Let X have Cauchy distribution Cau(9, 1) given G = 0. 

(a) Prove that the MP level a test of H : 0 = 0 O versus A : 0 = 0i for 
0i > 0o is essentially unique. That is, if (j> and -0 are both MP level a 
tests, then Pe(<t>{X) = %I>(X)) = 1 for all 0. 

(b) Prove that there is no UMP level a test of H : 0 = 0 O versus A : 0 > 
0 O for 0 < a < 1. 

19. Let the parameter space be the open interval ft = (0,100). Let X\ and 
X2 be conditionally independent given 0 = 0 with X\ ~ Poi{6) and 
X2 ~ Poi( 100 — 0). We are interested in the hypothesis H : 0 < c versus 
A : 0 > c. 



k = lim 



Po 




288 Chapter 4. Hypothesis Testing 



(a) Show that there is no UMP level a test of H versus A. 

(b) Show that T = Xi + X 2 is ancillary. 

(c) Find the conditional UMP level a test given T. 

(d) Find a prior distribution for 0 such that the conditional UMP level 
a test given T is to reject H if Pr(H is true|Xi = X\>X2 = £2) < a. 

20. Let {Pe : 0 £ Q} be a parametric family, and let p(x, 0) = (dPe/dx)(x) be 
the density of a member of the family with respect to Lebesgue measure. 
Assume that d 2 logp(x, 0)/dxdO exists for all x and 0. Prove that the family 
has increasing MLR if and only if d 2 logp(x, 9)/dxd0 > 0 for all x and 0. 

21. Prove Proposition 4.55 on page 240. 

22. Let n = {Oi,6 2 ,03} with 61 < 6 2 < 63. Suppose that given 9 = 0, X ~ 
AT(0,1). Let : 6 e {0i,02> and A : 6 = 0 3 . Show that each test <j> 
satisfying /3^(0i ) = /?</>(02) = a is inadmissible if 0 < a < 1 and the loss 
function is 0-1. 

23. Suppose that the parametric family has MLR increasing and that H : 0 < 
0o is the hypothesis of interest. Let the alternative be A : 0 > 0o- Suppose 
that the power function of every test is continuous. Prove that the UMP 
level a test is the UMC floor a test. 

24. Suppose that the parametric family has MLR increasing and that H : 0 < 
0o is the hypothesis of interest. Let the alternative be A : 0 > 0o- Suppose 
that the UMP level a test has base 7. Prove that the UMP level a test is 
the UMC floor 7 test. 

25. Show that the family of (7(0, 0) distributions for 0 > 0 does not satisfy the 
conditions of Proposition 4.67. Find two UMP level a tests of H : 0 = 1 
versus A : 0 > 1 which are not almost surely equal. 

26. Show that the family of Poi(6) distributions for 0 > 0 does not satisfy the 
conditions of Proposition 4.67. Nevertheless, prove that the collection of 
one-sided tests of the form (4.57) for H : 0 < 0 O versus A : 0 > 0 O forms a 
minimal complete class when the loss function is of hypothesis-testing type 
(see Definition 4.1.) 

27. *Suppose that Q = (0, 00) and that X u . . . , X n are IID f/(0, 0) given 0 = 

0. For 0 < a < 1, find a UMP level a test for each of the hypothesis- 
alternative pairs below: 

(a) H : 0 < 0 O versus A : 0 > 0 O . 

(b) H : 0 > 0o versus A : 0 < 0 O . 

(c) H : 0 = 0o versus A : 0 # 0o- 

(d) In part (a), find a second UMP level a test that differs from the one 
you found for part (a) with positive probability given 0 = 0 for 0 in 
a set of positive Lebesgue measure. 

28. (a) Suppose that X - U(0,6) given 0 = 0. Find the conditional distri- 

bution of Y = X~ a given 0 = 0. 
(b) Let a > 0 be fixed, and suppose that X - Par(a,6) given 0 = 0. 
Find UMP level a tests for each of the three hypothesis-alternative 
pairs in Problem 27 above. 



4.7. Problems 289 



29. The density of the NCB(a, /3,1/j) distribution is given in Appendix D. 

(a) Prove that, for fixed a and /?, this family of distributions has increas- 
ing MLR in x where is the parameter. (Feel free to pass derivatives 
under summations.) 

(b) Use the result of the previous part to show that the noncentral F 
distribution has increasing MLR also. 

(c) Also show that the noncentral t distribution has increasing MLR. 

30. Prove that if <f> is UMP level a for testing H : O G Qh versus A : 8 G Qa, 
then 1 - (j> is UMC floor 1 - a for testing H' : 6 G Qa versus A' : 6 G fin- 

31. *Let Xi, . . . , X n be IID U(0 - 1/2, 0 + 1/2) given 9 = 0. Let H : 6 > 0 O 

and A : 9 < 0 O . 

(a) Find the UMP level a test of # versus A. 

(b) Find the UMC floor a test of H versus A. 

(c) Suppose that we begin with a uniform improper prior for 9. Let the 
loss be 0-1-c with 1/(1 + c) = a. Find the formal Bayes rule. 

(d) Calculate the power functions of the three tests above. Compare them 
and explain the differences intuitively. 

(e) Find the UMP level a test of H versus A conditional on the ancillary 
U = max Xi — minXi. 

(f) Find the UMC floor a test of H versus A conditional on the ancillary 
U = maxXi — minXt, and show that this is the same as the UMP 
level a test conditional on the ancillary. 

32. Let Q = (-oo,0 o ], and let X u . . . ,X n be IID U(0 - 1/2,0 4- 1/2) given 
9 = 0. Let H : 9 = 0o and A : 9 < 0o- Suppose that we have a prior 
distribution such that Pr(9 = 0o) = po > 0 and that the conditional 
distribution of 9 given 9 < 0o has strictly positive density g(0) for 0 < 0o- 
Assume that the loss is 0-1-c. Find the formal Bayes rule for data values 
such that max{xi, . . . ,x n } < 0o 4- 1/2. (This condition assures that the 
data are consistent with the parameter space.) Does this test match any of 
the tests in Problem 31 above? 

33. Suppose that fie is a probability measure on (fi,r). Suppose that the 
conditions of Theorem 4.68 are satisfied and that there is a test with finite 
Bayes risk with respect to /ie. Suppose that the loss function is bounded 
below. Show that there is a one-sided test that is a formal Bayes rule. 

34. Let Q C IR and H : 9 < 0o. If power functions are continuously differen- 
tiable and <j> is the unique level a test with maximum derivative for the 
power function at 0o, show that <j> is LMP level a relative to d(0) = 0 — 0 O 
for 0 > 0 O . 

35. Let Xi, . . . , X n be conditionally IID with Cau(0, 1) distribution given 9 = 
0. 

(a) Find the LMP level a test of H : 9 < 0 O versus A : 9 > 0 O for 
0 < a < 1. (Hint: Pass derivatives under integral signs.) 



290 



Chapter 4. Hypothesis Testing 



(b) Prove that the power function of this test goes to 0 as 0 — ► oo. 
36. Let Q = (0, oo), and let 



(a) Show that this family of distributions has MLR. 

(b) Show that the conditions of Proposition 4.67 are not met. 

(c) Show that the UMP level a test of H : 9 = 0 O versus A : 9 > 0 O is 
unique if a < 1/2. 

(d) Show that the UMP level a test of H : 9 = 0 O versus A : 9 < 0 O is 
unique if a < 1/2. 

(e) If a > 1/2, find two UMP level a tests of H : 9 = 0 O versus A : 9 > 0 O 
that are not almost surely equal. 

Section J^.S.J^: 

37. Suppose that the conditions of Theorem 4.82 are met. Let the loss function 
be of the hypothesis-testing type for a two-sided hypothesis. Show that the 
class of tests of the form given in Theorem 4.82 is essentially complete. 

38. Suppose that /ie is a probability measure on (f2, r). Let the loss function be 
of the hypothesis-testing type for a two-sided hypothesis. Suppose that the 
conditions of Theorem 4.82 are satisfied and that there is a test with finite 
Bayes risk with respect to /xe. Suppose that the loss function is bounded 
below. Show that there is a test of the form given in Theorem 4.82 which 
is a formal Bayes rule. 

39. Suppose that X ~ N(0, 1) given 9 = 0. Let Qh be the set of rational 
numbers, and let Ua be the set of irrational numbers. Prove that the UMP 
level a test of H : 9 e ft// versus A : 9 € CIa is the trivial test <t>{x) = a. 

40. *Let X be a random variable, and define 



Let Pi and P 2 be two different possible distributions of X with correspond- 
ing expectations Ei and E 2 such that Pi < P 2 and ft < Pi. Suppose that 
Ei<t>w{X) = w for all w. Prove that g(w) = E 2 <M*) is continuous. (Hint- 
To prove that g is continuous at w, consider three cases. First, suppose 
that 1 > y w > 0, and prove that so long as z is close enough to w so that 
1 > 7* > 0 and c 2 = c w , then g(z) is close to g(w). Second, suppose that 
j w = 0. When w increases, either 7™ or c w must increase (or both). Either 
way, the increase can be made small enough so that g(w) does not change 
much. This is similar if w decreases. The third case, 7™ = 1, is similar to 
the second case.) 

41. Let X ~ Exp(0) given 9 = 0. Consider the two hypotheses H 1 : 9 < 1 
versus A x : 9 > 1 and H 2 : 9 0 (1,2) versus A 2 Q £ (1,2). 





4.7. Problems 291 



(a) Find the UMP level 0.05 tests of the two hypotheses. 

(b) Find the set of all x values such that the UMP level 0.05 test of H2 
rejects H2 but the UMP level 0.05 test of Hi accepts Hi . 

42. Let Qi C Q2 be strictly nested subsets of the parameter space. Let Hi : 
9 € Qi and At : 9 0 f2i for i = 1,2. Suppose that Li is 0-1-c loss for 
testing Hi versus Ai for i = 1,2 (with the same c for both cases). Consider 
the problem of simultaneously testing both hypotheses with action space 
{0, l} 2 , the first coordinate being the action for the first hypothesis, and the 
second coordinate being the action for the second hypothesis. (That is, for 
example, a = (ai, a 2 ) = (0, 1) means to reject H 2 but accept Hi.) Suppose 
that the loss function for the simultaneous tests is L(0,a) = L\{6,a\) + 
£2(0, a 2 ). A pair of tests (<j>i , fa) can be thought of as a randomized decision 
rule in this problem. In this case, <j>i(x) = Pr (reject Hi\X = x). We can 
say that a pair of tests (0i, <f> 2 ) is incoherent if there exists 0 € Q2 such 
that 

Pb({x : M*) < Mx)}) > 0- 

(a) Prove that an incoherent pair of tests is inadmissible. (Hint: Switch 
the two tests for all x such that <j>i(x) < fo(x).) 

(b) Consider the special case in that X ~ iV(0, 1) given 9 = 0, fii = {0 O }, 
and 0,2 = (-00, 0 O ]. Let <£i and 0 2 be the UMPU level a tests of their 
respective hypotheses. Find a pair of tests that dominates (0i, 0 2 ) in 
the decision problem defined above. 

Section 4-4 : 

43. For each k = 1,2, . . ., let -y kiCt denote the 1 - a quantile of the xl dis- 
tribution. Let Y k have NCxl(c 2 ) distribution. Prove that Pr(y fc > 7q ) < 
Pr(yi > 7tt ) for all fc > 1. (ffmt: Let Xi,...,X fc be IID JV(0, 1) given 
9 = 0, and consider two tests of if : 9 = 0 based on Y k X? and 

(EL*) 2 -) 

44. Prove Proposition 4.92 on page 254. 

45. Prove Proposition 4.93 on page 254. 

46. Let X u ...,X n be IID U(0 - 1/2,0 + 1/2) given 9 = 0. Let H : 9 = 0 O 
and ,4 : 9 ^ 0 O . 

(a) Let <f> be a test with size a. Prove that <j> is unbiased level a if <j>(x) = 1 
for all x which satisfy / X |e(*|0o) = 0. (Hint: Use the fact that if 
/x|e(z|0o)/x|eOz|0i) > 0 for all x € B, then Pe Q (B) = P 0l (B).) 

(b) Prove that there does not exist a UMPU level a test for a > 0. (Hint: 
You can slightly modify the UMP and UMC tests for the one-sided 
case in Problem 31 on page 289 to produce unbiased level a tests with 
maximum power for 0 < 0 O and 0 > 0 O , respectively. Then prove that 
it is impossible for a single test to achieve both maxima.) 

47. Prove Proposition 4.97 on page 255. 



292 



Chapter 4. Hypothesis Testing 



48. Suppose that the conditions of Lemma 4.99 hold and that the loss function 
is of the hypothesis-testing type. Prove that the class of two-sided tests is 
essentially complete. 

49. In a one-parameter exponential family with natural parameter 0, prove 
that the condition 

dv 0=0 O 
can be written as (4.117) if A/>(0o) = oc. 

50. Let Q C 1R and H : 6 = 0o versus A : 0 ^ Oq. If power functions are twice 
continuously differentiable and </> is the unique unbiased level a test with 
maximum second derivative for the power function at 0o, show that <j> is 
LMPU level a relative to d(0) = |0 - 0 O | for 0 ^ 0 O . 

Section 4-5: 



51. *Let the joint density of (Xi,X 2 ) given (0i,0 2 ) = (0i,02) be 

/Xi ) x 2 |ei,e 2 ( :E i J ^2pi,c/2j = _ q 2 ^ J(-°°>*iA x 2j, 

where <j> and $ are respectively the standard normal density and CDF. 
Find the UMPU level a test of H : 0 2 < c versus A : 0 2 > c. 

52. Let the parameter space be Q, = (0, 1) x IR. Conditional on (P, 0) = (p, 0), 
N ~ 2^71(10, p), and conditional on N = n, Xi, . . . , X n +i are IID N(0, 1). 
You get to observe JV, Xi, . . . , Xat+i- We are interested in the hypothesis 
H : 0 < 0 versus A : 0 > 0. 

(a) Find the UMPU level a test of H versus A. 

(b) Suppose that P = p is actually known. Show that there is no UMPU 
level a test of H versus A. 

53. In the framework of Example 4.121 on page 266, let H : Ti < /io and 
A : Ti > /xo. Prove that the usual size a one-sided i-test is UMPU level a. 

54. *Suppose that Xi, . . . , X n are conditionally IID a 2 ) given 0 = (/x, <j). 

Define 



n 1 n 

i=l ^ i=l 

£i = \/n— — -, t 2 ^ y/n 



#n) 2 j 

= 1 

5C n — ^ 



Let ai, a 2 > 0 and a = ai -f a 2 < 1. Define 

!(ai)or if * 2 >T n -_ 1 1 (l-a 2 ), 



r i if t^T-.M 

^ w " \ 0 otherwise. 



(a) Prove that 0 has size a as a test of H : a < M < b versus A : not H. 
(Hint: For each /x € [a, 6] find the distributions of ti and t 2 given 
0 = (/i, a) and see what happens as a — > oo.) 



4.7. Problems 293 



(b) Prove that <f> is not an unbiased level a test if both a\ and a 2 are 
strictly positive. {Hint: Show that the limit of the power function at 
(a, a) is ai as a — ► 0. Since the power function is continuous, what 
does this say about the power function at (a -he, a) for a and e small?) 

55. Suppose that X\ ~ Bin(n\,p\) independent of X 2 ~ Bin(ri2,P2) condi- 
tional on (Pi, ft) = (pi,p2). Find the UMPU level a test of H : Pi = P 2 
versus A : Pi ^ P2. 

56. *Let Xi, . . . , X n be IID given 9 = (0j, 0 2 ) with density 

fx\e(x\0 u e 2 ) = exp{-^(6>i) + 0i cos(x - 0 2 )} I{o,2n)(x), 

where 

/•2?r 

= log / exp(ycos(t))dt. 
Jo 

Let Q = {(0i,0 2 ) : #i > 0,0 < 0 2 < 2tt}. 

(a) Let Hi = Oicos0 2 and H 2 = 6isin6 2 . Let H : 0 2 = 0 versus 
A : 6 2 7* 0 and let #* : H 2 = 0 versus A* : H 2 ^ 0. Prove that the 
UMPU level a test of iT is also UMPU level a for H. (You need not 
actually find the test to prove this.)(Hint: Use the fact that if / and 
g are analytic functions of one variable and f(x) = g(x) for all x on 
some smooth curve, then / = g.) 

(b) Find the form of the UMPU level a test of H as closely as you can. 
(You will not be able to find the cutoffs in closed form.) 

(c) Why is this a reasonable test of H* but not of HI 

57. *Let X and Y have joint density given (M, A) = (/x, A), 

/x,y|M,A(x,i/||i, A) = /iAexp(-/xx - Ay)/ [0f oo)(x)J [0f oo)(l/). 
Find UMPU level a = 0.2 tests for the following hypotheses: 

(a) H : A < M + 1 versus A : A > M + 1. 

(b) H : A = M versus A : A # M. 

58. *Consider a breeding experiment in which each observation is a classification 

into one of three groups. Suppose that the observations are conditionally 
independent given (Pi, ft, ft) = (pi,p 2 ,p 3 ) with conditional probability 
Pi of being classified as group i. 

(a) Find the form of the UMPU level a test of H : P 2 = 3Pi versus A : 
P 2 7^ 3Pi based on n observations, and say how you would determine 
the exact rejection region. 

(b) For the case n = 2, find the precise form of the UMPU level 0.1 test. 

(c) Suppose that (Pi , P 2 , P 3 ) has a prior distribution of the Dirichlet form 
with density 

for all pi > 0, pi + P2 + P3 = 1, where all a* > 1. Find the posterior 
mean of P 2 /Pi. 



294 



Chapter 4. Hypothesis Testing 



59. We will observe Yi, . . . , Yjt where the conditional distribution of Y< given 
(B 0 ,Bi) = (J3o,Px) is Bin(m,pi), where p. = [1 + exp(/3 0 + jfliXi)]" 1 , the 
Yi are conditionally independent, and the n» and x» are all known. 

(a) Find minimal sufficient statistics. 

(b) Find the form of the UMPU level a test of H : Bi < c versus A : 
Bi > c. 

(c) For the special case k = 2, ni = 3, ri2 = 2, xi = 1, #2 = 2, c = 0, 
a — 0.1, find the exact test for all possible data values. 

60. Consider the proof in Section 4.5.6 that the F-test is a proper Bayes rule. 

(a) Prove that the prior distribution given in Section 4.5.6 puts positive 
probability on every open subset of the original parameter space for 
6. 

(b) Prove that the Bayes rule is admissible. 
Section 4.6: 

61. Suppose that Xi are IID 1) given M = /i for i = 1, . . . , n and that we 
use Lebesgue measure as an improper prior for M. If H : M < c, show that 
the posterior probability that H is true equals the P- value associated with 
the family of one-sided tests. 

62. Suppose that Xi are IID AT(/x,(T 2 ) given (M,E) = (/z,<r) for i = 1, . . . ,n 
and that we use the usual improper prior with Radon-Nikodym derivative 
1/cr as an improper prior. If H : M < c, show that the posterior probability 
that H is true equals the P-value associated with the family of one-sided 
i-tests. 

63. Prove Proposition 4.144 on page 280. 

64. In Example 4.146 on page 281, suppose that we change H to H : P > po. 
Prove that the P-value, associated with the family of UMP level a tests, of 
an observed x < n is the posterior probability that the hypothesis is true 
based on an improper prior J5e£a(l,0). 

65. Let {Po :0 € £1} be a parametric family. For each a, let <j> a {x) = Is a (x) be 
a size a test of H : 6 = 0 0 versus some alternative such that, for 0 < a < 1, 
S a = HpxxSp. Suppose that So — 0 and Si = X. 

(a) Prove that if a < /?, then S a C Sp. 

(b) Let p H (x) be the P-value of an observed x. Show that, given 0 = 0 O , 
Ph(X)~U(0,1). 

(c) Suppose that </> Q is unbiased for each a. Prove that 

Pe{pH{X)<a)>P'e Q {pH{X)<a). 

66. *Let X - N(0, 1) given 6 = 0. Let y(0,x) be the P-value for testing He : 

0 = 0 versus A e : 0 7^ 0 for data X = x. 

(a) If the hypothesis is H : 0 € [a, 6] versus A : 0 £ [a, 6], and X = x > 
(a + 6) /2 is observed, prove that the P-value is &(a - x) 4- &(b - x), 
where $ is the standard normal CDF. 



4.7. Problems 295 



(b) Assume that Lebesgue measure is used as an improper prior for 6 
and that data X = x > b are observed. Prove that for all hypothe- 
ses of the form 0 < 6, 0 = 6, and 0 G [a, 6], the P- value equals 
E(g(Q,x)\H \str\ie,X = x). 

67. Let X rsj JV(0, 1/n) given 0 = 0, where n is known. Let the prior for 0 
be a mixture of a point mass at 0 and an N(0, 1) distribution. Consider 
the hypothesis H : 0 = 0 versus A : 0 ^ 0. Draw a graph of the Bayes 
factor as a function of x and a graph of the P-value as a function of x for 
n = 1, 10, 100. 

68. Return to Problem 31 on page 289. 

(a) Find the three P- values relative to the three sets of tests in parts (a), 
(b), and (c). 

(b) Find the posterior probability that H is true in part (c). Does it equal 
any of the three P- values? 

69. Let X have JV(/x, 1) distribution given M = and let Lebesgue measure 
be an improper prior for M. For hypotheses of the form M < c, M > c, 
or M = c, we will show that the P-value for the usual family of tests is 
the posterior probability that M is farther from X (in the direction of the 
alternative) than X is from c. 

(a) Let H : M < c versus A : M > c. Prove that the P-value equals the 
posterior probability that M - X > X - c. State and prove a similar 
result for H : M > c. 

(b) Let H : M = c versus A : M ^ c. Prove that the P-value equals the 
posterior probability that |M - X\ > \X — c\. 

(c) Can you extend this interpretation to the case of H : a < M < b 
versus A : M 0 [a, b]? 



Chapter 5 
Estimation 



In Chapter 3, we discussed methods for choosing decision rules in problems 
with specified loss functions. In Section 3.3, we gave an axiomatic derivation 
of some of those methods. This derivation led to the conclusion that there 
is a probability and a loss function, and one should minimize the expected 
loss. There are decision problems in which N and fl are the same (or nearly 
the same) space and the loss function L(0, o) is an increasing function of 
some measure of distance between 0 and a. Such problems are often called 
point estimation problems. The classical framework makes no use of the 
probability over the parameter space provided by the axiomatic derivation. 
One can also try to ignore the loss function as well. To estimate 0 without a 
specific loss function, one can adopt ad hoc criteria to decide if an estimator 
is good. In this chapter, we will study some of these criteria as well as some 
criteria for the problem of set estimation. In set estimation, the action 
space is a collection of subsets of the parameter space (or the closure of 
the parameter space). The idea is to find a set that is likely to contain the 
parameter without being "too big" in some sense. 



5.1 Point Estimation 

A point estimator of a function g of a parameter 6 is a statistic that takes 
its values in the same set (or at least a similar set) as does #(6). One 
popular type of point estimator is an unbiased estimator. 

Definition 5.1. Let fl be the parameter space for a parametric family 
with P e and E e specifying the conditional distribution of X given 6 = 6. 
Let g : Q G be some measurable function. Let G D G. A measurable 



5.1. Point Estimation 297 



function <j) : X — > G is called an estimator ofg(&). An estimator (j> of #(6) 
is called unbiased if Ee((j)(X)) = g(0), for all 0 € ft. The Has ofcj) is defined 
as 

bi(O) = Ee(<KX))-g{0). 

The next example is one of several that led some early researchers to 
believe that unbiased estimators may not be bad to use. 

Example 5.2. Suppose that P e says that {X n }%Li are IID AT(^,<r 2 ), where 
0 = If X = (Xi, . . . , An), then we define A" = Y^=i Xi/n. It is easy to 

see that E e (X) = p. So, if p(0) = /x, we see that (f>(X) = X is unbiased. 

The following example shows that restricting attention to unbiased esti- 
mators may lead to an impasse. 

Example 5.3. Suppose that P 0 says that X ~ Exp(9). If <t>(X) is to be an 
unbiased estimator of 9, then 

PCX) 

E e <t>{X)= / 4>{x)eexp{-0x)dx = 6 1 
Jo 

for all 9. This happens if and only if J 0 °° exp(-0x)cte = 1, for all 0. By Theo- 
rem 2.64, we can differentiate the left-hand side with respect to 9 under the inte- 
gral and get f Q x<j>{x) exp(-9x)dx = 0 for all 0. This means that E e (X(j)(X)) = 0 
for all 0. Since X is a complete sufficient statistic, <t>{x) = 0, a.s. [P e ] for all 0. 
This contradicts </>(X) being unbiased. Hence, there are no unbiased estimators 
of 0. 



5.1.1 Minimum Variance Unbiased Estimation 

It is natural to check how unbiased estimators fare under certain loss 
functions. The most common one to use is squared-error loss L(0, a) = 
{g(0) - a) 2 . The risk function of an estimator 0 is 

R(0, <t>) = Ee {(g (0) - </> (X)) 2 } = bl(0) + Var^(X). 

If an estimator is unbiased, the risk function is just the variance. This 
suggests the following "optimality" criterion for unbiased estimators. 

Definition 5.4. An unbiased estimator 0 is a uniformly minimum vari- 
ance unbiased estimator (UMVUE) if 0 has finite variance and, for every 
unbiased estimator fa Var^(X) < Var^(X) for all 9 e fl. 

UMVUEs are not necessarily good, as we will see later. The criterion of 
unbiasedness only means that the average of <j)(X) with respect to P e is 
9(0) for all 0. It does not mean that you expect <f>(X) to be near g(0) nor 
does it mean that you expect g(Q) to be near 4>(x) after you have seen 
X = x. 

We mentioned earlier that the concept of complete sufficient statistic 
would play a role in unbiased estimation. The following theorem is due to 
Lehmann and Scheffe (1955). 



298 Chapter 5. Estimation 



Theorem 5.5 (Lehmann-Scheffe theorem). If T is a complete statis- 
tic, then all unbiased estimators of g(Q), that are functions of T alone, 
are equal, a.s. [Pe] for all 0. If there exists an unbiased estimator that is a 
function of a complete sufficient statistic, then it is a UMVUE. 

Proof. Suppose that 0i(T) and 02 (T) are unbiased estimators of #(6). 
Then E*[0i(T) - 0 2 (T)] = 0 for all 0. Since T is a complete statistic, 
it follows that 0i(T) = 0 2 (T), a.s. [P e ]. Now, suppose that there is an 
unbiased estimator </>(X) with finite variance. Define 03 (T) = E(0(X)|T). 
Then 03 (T) is unbiased by the law of total probability B.70. Using squared- 
error loss, the Rao-Blackwell theorem 3.22 says R(9, 0 3 ) < R(0, 0) for all 0. 
Since the risk function is the variance for unbiased estimators, this makes 
03 a UMVUE. □ 

Example 5.6. Suppose that {X n }£° =1 are IID iV(/x,a 2 ) given 9 = (M,E) = 
(/i, a). Let X = (Xi , . . . , X n ). Then 

n n 

X = - V X u and S 2 = -i- V (X 4 - X) 2 

i=l t=l 

are complete sufficient statistics. Since they are unbiased, they are UMVUE of M 
and E 2 , respectively. Notice that S 2 does notjminimize mean squared error, even 
among estimators of the form c]C" =1 (Xi - X) 2 . (See Example 3.25 on page 154 
and Problem 11 on page 210.) 

Example 5.7. Suppose that Pe says that X has Poi(0) distribution. We know 
that X is a complete sufficient statistic. Let #(©) = exp(— 3G). We will find the 
UMVUE of The required condition is 

~ nx 

E,0(X) = 2^0(x)exp(-0)— = exp(-30), 

x=0 

for all 0. It follows from the uniqueness of Taylor series expansions for analytic 
functions that 0 is unbiased if and only if 0(x) = (-2) x for x = 0, 1, 2, . . .. This 0, 
although UMVUE, is an abominable estimator of g(B). We will see some better 
estimators later in this chapter. (See Examples 5.29 and 5.32.) 

The following results are useful when there is no complete sufficient 
statistic. 

Proposition 5.8. Let 6q be an unbiased estimator ofg(Q), and let 

U = {U: E e U(X) = 0, for all 6}. 
Then, the set of all unbiased estimators of g(Q) is {6 0 + U :U eU}. 

Theorem 5.9. An estimator 6 is UMVUE ofE e 6{X) if and only if, for 
every U eU, Cov 0 (6(X), U(X)) = 0. 



5.1. Point Estimation 299 



Proof. For the "only if" part, suppose that 6 is UMVUE. It is clear that 
if Vax 0 U(X) = 0 for all 0, then Cov 0 (S{X), U(X)) = 0. So, let U 6 U be 
such that VaxeU(X) > 0 for some 0. Let A € 1R, and define 6\ = 6 4- XU. 
Then #a is unbiased also, and for every A, 

Var^(X) < Vai 9 6 x (X) 

= Vai 0 6(X) + 2ACov*(6(X), U{X)) + \ 2 Vai B U{X). 

This is true for all A and all 0 if and only if 

\ 2 Vnr e U(X) > -2ACov*(«(X),tf(X)), 

which, in turn is true for all A and 0 if and only if Covo(6(X),U(X)) = 0 
for all 0. 

For the "if" part, assume that for all U £ U, Cov 0 (6(X),U(X)) = 0 
for all 0. Now, let S\(X) be an unbiased estimator of E 0 6(X). Then there 
exists U GU such that 6\ — 6 + U. It follows that 

Var*(«ipO) = Var e (6(X)) + 2Cov^((?)(X),J7(X))+Var 0 (C/(X)) 
= Var*(«(X)) + Var*(f7(X)) > Var tf 

hence 6{X) is UMVUE. □ 
Sometimes unbiased estimators exist, but none is UMVUE. 

Example 5.10. Suppose that P$ says that Yi, Yi, • • . are IID Ber(0). Set 

x= f i ifyi = i, 

1 # of trials until 2nd failure otherwise, 

and suppose that we observe only X. Then 

t ( \n\ \ 0 if x = 1, 

fx\e(x\0) = | ^_ 2(1 _ e) 2 if x = 2 , 3, . . .. 

Define the estimator 6o to be 6o(x) = 1 if x = 1 and #o(x) = 0 if not. Then 
So is an unbiased estimator of G. We will now try to find a UMVUE. Assume 
EeU(X) = 0, for all 0. Then 

oo 

E 0 U{X) = 17(1)0 + ]T 0 X ~ 2 (1 - 0) 2 U(x) 

oo 

= U(2) + ]T 0* [C/(fc) - 2£/(fc + 1) + + 2)] = 0 
it=i 

if and only if U(2) = 0 and U(k) = -(fc - 2)17(1) for all k > 3. This characterizes 
all functions in U according to the value t = t/(l). That is, 

U = {U t : U t (x) = (x - 2)t, for all x}. 



300 Chapter 5. Estimation 



Every unbiased estimator of 0 is S t (x) = 6o(x) + (x - 2)t, for some t. In order 
for 6 t to be UMVUE, it must have 0 covariance with every U a £ U. That is, for 
all s and all 0, 

oo oo 

0 = fx\e(x\e)6 t (x)U.(x) = 0(-s)(l -t) + ^(1 - 0) 2 8*- 2 ts(x - 2) 2 . 

x=l x= 2 

Divide both sides by (1 - 0) 2 to get 

x=2 

By rewriting 0/[l - 0] 2 as an infinite sum, we get 

oo oo 

fc=l fc=l 

Since these two series in this last equation are analytic functions of 0, it must be 
that s(l — t)k — tsk 2 for all s, k. This is not possible, hence there is no t such 
that these equations hold. And there is no UMVUE. 

Oddly enough, there is a locally minimum variance unbiased estimator 
in Example 5.10. 

Definition 5.11. An unbiased estimator 6o(X) is locally minimum vari- 
ance unbiased, LMVUE, at 6o if for every other unbiased estimator 6(X), 
Var0 o <5o(*) < Var^X). 

Example 5.12 (Continuation of Example 5.10; see page 299). First, note that 

Var06 t (X) = E*6 t 2 (X)-0 2 , 
so a LMVUE at Oo can be found by minimizing 

E 6o 6 2 (X) = Ee 0 ([6o(X) + U t (X)} 2 ) 

oo 

= Ml-t) 2 +X^S~ 2 (l-0o) 2 (x-2) 2 t 2 . 

x=2 

This expression is quadratic in t and can be minimized by choosing 

t= l+f^-^l-flo) 2 
fc=l 

which is different for each Oo. 

A LMVUE at 0 O is not the u best" estimator if 6 = 0 O ; it is merely 
an unbiased estimator such that the conditional variance given 6 = 0 O is 
smaller than that of any other unbiased estimator. 



5.1. Point Estimation 301 



5.1.2 Lower Bounds on the Variance of Unbiased Estimators 

Suppose that one is interested in unbiased estimators. It would be nice 
to know how low the variances of such estimators can be. Under some 
regularity conditions, there exist lower bounds for the variances of unbiased 
estimators. The Fisher information plays an important role in these lower 
bounds. 1 

Theorem 5.13 (Cramer-Rao lower bound). Suppose that the three FI 
regularity conditions hold (see Definition 2.78 on page 111), and letlx(0) 
be the Fisher information. Suppose that 2x(#) > 0, for all 6. Let (f>(X) 
be a one- dimensional statistic with Eo\<fr(X)\ < oo for all 0. Suppose also 
that J <l>{x)fx\e(x\0)diy(x) can be differentiated under the integral sign with 
respect to 0. Then 

Before proving this theorem, we should look at some examples. 

Example 5.14. Suppose that X ~ N(0 y b) given 0 = 0 and <f>(x) = x. Then, 
the conditions of Theorem 5.13 are satisfied, and we calculated Tx{0) = 1/& in 
Example 2.80 on page 111. So 

E 6 (t>(X) = 0, ~E*0(X) = 1, Var*0(X) = 6. 

In this case, the Cramer-Rao lower bound is met exactly. 

Example 5.15. Suppose that X ~ (7(0, 0) given 6 = 0. In this case fx\o(x\0) = 
0~~ 1 /(o,e) (x). We saw in Example 2.81 on page 111 that the conditions of The- 
orem 5.13 are not met. Nevertheless, we were able to calculate something that 
could have been called Fisher information, namely Tx{0) = 1/0 2 . Let (f>(x) = x. 
Then E e <t>(X) = 0/2 and Var^X) = 0 2 /12. We can calculate 

If the Cramer-Rao lower bound held here, it would say that Vare(f>(X) > 0 2 /4, 
which is clearly false. 

Proof of Theorem 5.13. Let B and C be the sets described in the FI 
regularity conditions. (See Definition 2.78 on page 111.) Let D = CC\B C , so 
that, for all 0, Pe(D) = 1 and f D fx\e{x\0)di/(x) = 1. Taking the derivative 
with respect to 0 of this integral, we get 

° - /^^<*^»=/ D 4f^ /xie(x|e),i, ' w 

= E, [^log/ w (X|0)l . 

lr The first lower bound is due to Rao (1945) and Cramer (1945, Chapter 32; 
1946). 



302 Chapter 5. Estimation 



Also, we can differentiate to obtain 

E e <P(X) = J <t>(x)^f xle (x\9)dv(x) = E e 

= E 0 - Eo4>{X)] ^ log/x|e(^w} , 



d, 

de 



<P(X)^logf x \ e (X\d) 



since the term being added on has zero mean. Now take the absolute value 
and use the Cauchy-Schwarz inequality B.19: 



< 



^E 0 [4>(X)-Ee4>(X)]\Ee 



^\ogf xle (X\0) 



= v/Var^(X)VlxW- 



Now square the extreme ends of this string and divide by lx(Q)> D 
For an unbiased estimator 4>(X) of 8, the smallest possible variance 
is l/lx{0), since dEe(f)(X)/d8 = 1. A necessary and sufficient condition 
for the lower bound to be achieved is that the < become an = in Theo- 
rem 5.13. The < in Theorem 5.13 was introduced by the Cauchy-Schwarz 
inequality B.19, which provides for equality if and only if the two fac- 
tors are linea rly rela t ed. (T hat is, if E(X) = 0 and E(Y) = 0, then 
|E(XV)| = ^/YJj^y/ny^) if and only if there exist a and b such that 
aX + bY = 0, a.s. and ab ^ 0.) So, the Cramer-Rao lower bound is an 
equality if and only if <j)(X) and the score function d\ogfx\&{X\6)/dO are 
linearly related, that is, 

^ log/x|eW0) = a{6)<t>(x) + a.s. [P e ], 
for all 0. If we solve this differential equation, we get 

fx\s(x\e) = c(9)h(x)exp{7r(d)(t>(x)}. 

This means that the Cramer-Rao lower bound can be sharp only in a 
one-parameter exponential family with <f)(X) being a sufficient statistic. 

Example 5.16. Suppose that X ~ Exp(X) given A = A. Set 9 = 1/A. Then 
fx\e(x\0) = ^exp j-^xj J ( o,oo)(z), 
^log/xieW) = + 

Since the score function is a linear function of x, it follows that <t>(x) must also be 
a linear function of x if <t>(X) is to achieve the lower bound. If <f>{x) = a + bx, then 
E e d)(X) = a + b6, so a = 0 and b = 1 gives an unbiased estimator that achieves 
the Cramer-Rao lower bound. The reader should verify that this is indeed the 
case. 



5.1. Point Estimation 303 



Example 5.17. Outside of exponential families, the Cramer-Rao lower bound 
cannot be achieved. For example, suppose that 



p/ g+1 \ a+l 



This is the family of t a {9, 1) distributions. In order for the variance to exist, 
suppose that a > 3. Then 

S3alog/jf|e(x|ff) = — 



Call this ^(x). Then l x (0) = -E^(X). 



Since the denominator of the integrand looks like part of the t a + A density, we will 
perform the following transformation: 

z~0 x-9 



The integral becomes 



« v«+4r( f)v ^y (1+ _i 3( ,_, )2) ^ d2 - 

Except for the constant, the integral is E(l - l/ 2 /(a + 4]), where U ~ t a+4 . The 
correct constant to make the integral equal to this expected value is 

r(*f) 



and the expected value is (a + l)/(a + 2), so the result is 

i x( 0) = i±l£±_i rv r(^) r(*±*),/£T4K _ a + : 

a + 2 a V a + 4 r(f)VHi r(«±S) ~^+3' 

This means the Cramer-Rao lower bound is (a + 3)/(a + 1). We know that 
h ' ~Jr 1S L UMVUE because AT is a complete sufficient statistic, and Var„(X) = 
a/ (a - 2), which is always larger than the Cramer-Rao lower bound. 

There is another lower bound that applies in more general cases, such as 
when the set of possible values for X depends on O. This next lower bound 
is due to Chapman and Robbins (1951). 



304 Chapter 5. Estimation 

Theorem 5.18 (Chapman-Robbins lower bound). Let 

m{0) = E*0(X), 
supp(0) = closure of {x : fx\e( x \Q) > 0}- 

Assume that for each 0 G fi, tfiere is 0' ^ 0 such that supp(0') C supp(6). 
Then, 



Var^(/>(X) > sup 



[m(0) - m(0')] 2 



V, L /xiel A l e ) J ) 

Proof. Let 9' be such that supp(9') C supp(6*). Let 

U{X) ~ fx\e(X\6) 
Then E e t/(X) = 1 - 1 = 0, and 

Co Ve (U(X),4>(X)) = f [<t>(x)f x]e (x\8')-<t>(x)f xl6 (x\e)]dv(x) 

J supp{6) 

= rn(0 f )-m{0), 

since supp(8') C supp(6). By the Cauchy-Schwarz inequality B.19, the 
square of the covariance is at most the product of the variances, so 



[m(6') - m(0)} 2 < Vai e <KX)VmU(X). 



□ 



Example 5.19. This is a case in which the Cramer-Rao lower bound does not 
apply. Let P e say that {X n }n=i are IID with density f X \e(x\0) = exp(0-x)/p fO o). 
Let X = (Xi,...,X„). Then aupp(0) = [0,oo) and aupp(0') C supp{0) so long 
as 0' > 0. From the proof of Theorem 5.18, 

U(X) = exp{n(0' - 0)} V,oo)(minXi) - 1. 

If 0(X) is an unbiased estimator of 6, then [m{6) - m(6')\ 2 = (0 - 0') 2 , and 

EeU(X) 2 = exp{2n(0' - 0)}P*(minXi > 0') 

-2exp{n(0' - 6>)}P e (minXi > 0') + 1, 

P,(minXi > 0') = (P*(Xi > A'))" 



= exp{-n(0' - 0)}, 



exp(0 — x)dx 

EeU(X) 2 = exp{n(0' - 9)} - 1. 
The Chapman-Robbins lower bound is 

(61 - fl') 2 0.1619 
Var 9 0(X) > sup exp{n((?/ _ $)} _ j « „ 2 • 

A simple unbiased estimator is tf>(X) = minX* - 1/n, which has variance 1/n 2 



5.1. Point Estimation 305 



Another way to improve the Cramer-Rao lower bound is to raise it if it 
is unattainable. We know that it is attained if <t>{X) is perfectly correlated 
with the score function d log f X \e{X\0)/dO, that is, if the regression of <t>{X) 
on the score function has 0 residual. If this is not possible, the residual might 
be made smaller by regressing <f)(X) on more than just the score function. 

Lemma 5.20. Let <j){X) be an unbiased estimator ofg(&) and let ipi(x,6), 
i = 1, . . . , k, be functions that are not linearly related and such that 

7 T = (7i,---,7fc), li = Cov e {<j>{X)^ i (X,9)), 
C = ((<*,)), *, = Cov^X, 0),^ 
Then Va.r e <j>(X) > ^C^j. 

Proof. The covariance matrix of {<f>(X), i>i(X,9), . . . , tl>k{X ,9)) T is 

^ Var^(X) 7 J ^ 

The inequality follows from the fact that the covariance matrix is positive 
semidefinite and C is nonsingular. □ 

Lemma 5.20 has two corollaries, one of which is an improvement on 
the Cramer-Rao lower bound and the other of which is a multiparameter 
version of the Cramer-Rao lower bound. 

The first corollary to Lemma 5.20 is one that attempts to improve the 
Cramer-Rao lower bound by making use of the fact that the inequality 
can fail to be an equality because the score function is not linearly related 
to the estimator. To put this another way, if the residual from the linear 
regression of the estimator on the score function has nonzero variance, then 
it might be possible to get the residual variance down by regression on more 
than just the score function. This is the approach taken in Corollary 5.21. 

Corollary 5.21 (Bhattacharyya system of lower bounds). Assume 
the conditions of Theorem 5.13, assume that k partial derivatives with re- 
spect to 6 can be passed under the integral sign, and assume that J(0) 
(defined below) is nonsingular. Then Var^0(X) > 7 7 (0)J~ l (6)~f(0), where 

7iW = ^eHX), 
Jij{e) = Cov 0), tyiX, 9)) , 

Proof. All we need to do, in order to apply Lemma 5.20, is note that 

JIewW - cov^(x), 7 _L^| ?/Xie( x|, ) ) , 



/ 71 W 

7(0) = ; 

J(6) = ((Jii(9))), 



306 Chapter 5. Estimation 

which follows for higher derivatives in just the same way that it did for the 
first derivative in the proof of Theorem 5.13. □ 
The Cramer-Rao lower bound is the special case with k = 1. 

Example 5.22. Suppose that X ~ Exp(X) given A = A. Set 9 = 1/A. Then 
fx\e{x\0) = i exp | -ixj / ( o,oo) (x), 
^/x,e(*|0) = (-\ + £)fx\e(x\0), 
j&fxie(x\0) = ^-| + |!)/ xie (^). 

Let 4>(x) = x 2 . We can easily calculate E*(0(X)) = 20 2 and Var*(0(X)) = 2O0 4 . 
Since Ix(Q) = I/O 2 , and dE e {<t>(X)) / dO = 40, the Cramer-Rao lower bound on 
the variance of <t>{X) is 160 4 . A little bit of calculation yields 

'»-(* i)- *»-(?)■ 

So, l{0) T J~ l {0)~i(0) = 2O0 4 , and the Bhattacharyya lower bound is achieved. 

Corollary 5.23 (Multiparameter Cramer-Rao lower bound). As- 
sume £Aa£ £/&e F7 regularity conditions on page 111 hold. Let Tx(9) = 
((lij(0))) be the Fisher information matrix, and suppose that it is posi- 
tive definite. Suppose that f <p(x)f x \e(x\0)dv(x) can be twice differentiated 
under the integral sign with respect to coordinates of 0 and that 

E e \<f>(X)\ < oo and j T (9) = (•••, ^E«*W> • ' ) ■ 

Then Var fl 0(X) > -y T (9)1^ (0h(0) ■ 

Proof. All we need to do, in order to apply Lemma 5.20, is note that 



_a 

39. 



-Ee<l>(X) = Cov* ($(X), ^ logfx\e(X\9^j , 



just as in the proof of Theorem 5.13. D 

Example 5.24. Suppose that P e says X ~ N((m,(t 2 ), where 0 = {p,o). This is 
the same as Example 2.83 on page 112. There we calculated 

*rW-(t ?)■ 

Now, set <j>(X) = X 2 . Then Ee4>(X) = n 2 + <J 2 and 

7l (0) = 2/i, Vaxe<t>(X) = 2<r 4 + 4j*V. 

72 (<?) = 2a, 7 T (0)Xx(0)- 1 7(0) = 4^V 2 + 2a 4 . 

The Cramer-Rao lower bound is met exactly. 



5.1. Point Estimation 307 



5.1.3 Maximum Likelihood Estimation 

If the posterior density of 9 is very high at some value, say 6o, and relatively 
low everywhere else, it means that we are quite sure that 0 is near #o- If 
the prior density of 0 is fairly flat near #o and is not orders of magnitude 
larger at other values of 9 than it is at 0o> then the posterior density will 
differ from the likelihood function only by a constant factor near 0 O - 2 

Definition 5.25. Let I be a random quantity with conditional density 
fx\e{x\0) given 0 = 0. If X = x is observed, then the function L(0) = 
fx\e(%\0) considered as a function of 9 for fixed x is called the likelihood 
function. Any random quantity 0 such that 

wmfx\e(X\0) = f X \e(X\e) 

is called & maximum likelihood estimator (MLE) o/0. If 0 = (0i, 0 2 ) and 
0 = (©i, 0 2 ), then 0 X is called an MLE of ©j . 

The idea of maximizing the likelihood function in order to estimate a 
parameter dates back to Fisher (1922). 

Example 5.26. Suppose that Xi,...,X n are conditionally independent given 
9 = 9 with [7(0, 9) distribution. Then, 

fx\e(x\9) = ^J[o,e](maxa;i)/ [ o, maXiXi ](minxi). 

As a function of 0, the maximum of this function is at 9 = maxi Xi. Hence the 
MLE is G = maxiXi. 

Suppose that we had denned the density of each Xi to be L 0 e) (x) instead of 
using the closed interval. In this case there is no value of 9 at which the maximum 
is achieved. At first, it would seem that this could easily be fixed by replacing 
max by sup in the definition of MLE. This would also require some continuity 
condition on the likelihood function. It turns out to be very inconvenient to do 
this. Rather, we should use the closed interval for the density. 

The following example shows that the MLE may exist but not be unique. 

Example 5.27. Suppose that X u . . . , X n are conditionally independent given 
0 = 9 with U{9 - 1/2,9 + 1/2) distribution. Then, 

fx\e(x\9) = ^-i^ijKinxO/^^^^^max^). 

As a function of 0, this is constant for max, Xi - 1/2 < 9 < mim Xi + 1/2. Any 
random variable 0 between max, Xi - 1/2 and mini Xi + l/2 is an MLE. 



2 This observation has led some people to try to base inference on the likeli- 

moqq w Ctl ° n al T e ' l ath6r than the P° sterior distribution. See Barndorff-Nielsen 
(1988) for an in-depth study of likelihood. 



308 Chapter 5. Estimation 



Theorem 5.28. Let g be a measurable function from Q to some space G. 
Suppose that there exists another space U and a one-to-one measurable 
function h : ft -» G x U such that h(8) = (g(0),g*(0)) for some function 
g*. IfG is an MLE o/9, then g(Q) is an MLE of g(&) . 

Proof. Since h is one-to-one, the parameter might just as well be ^ = 
ft(9). The likelihood for $ is fx\*\x\tl>) = /xieW*" 1 WO)- For fixed x, 
if the maximum of fx\e{x\9) occurs at 0 = 0, define t/> = /i(0). Then 
the maximum of fx\ei x \0) occurs at 0 = h" 1 ^). Now, suppose that the 
maximum of fx\^{ x \^) occurs at = tp Q . If fx^i^o) > /x|*(#l</0> 
then 0 = /i -1 (V>o) would provide a higher value for fx\e(%\0) than 0. This 
would be a contradiction. It follows that ip = x[> provides a maximum for 
fx\*(x\il>). It follows that the MLE of # is h(Q) and the MLE of the first 
coordinates of \P, namely #(9), is g(Q), the first coordinates of h(Q). □ 

Example 5.29 (Continuation of Example 5.7; see page 298). Suppose that X 
given 9 = 0 has Poi(0) distribution and #(0) = exp(-30). Since the MLE of 
9 is X, and g is one-to-one, the MLE of #(9) is exp(-3X). This is far more 
reasonable than the UMVUE (-2)*. 

If the loss function is L(0,a) = (0 + \ log a) 2 and the (improper) prior dis- 
tribution has Radon-Nikodym derivative 1/0 with respect to Lebesgue measure, 
then the formal Bayes rule is also exp(— 3X). 

Example 5.30. Let X u . . . ,X„ be IID N(^a 2 ) given 9 = (M^E) = (/x,<r). It 
is not difficult to see that the MLE of 9 is (X, y/W/n), where X = Xi l n 

and W = XXi^ ~ ^) 2, Su PP° se that we want the MLE of m2 ' The function 
#(M,E) = M 2 is not a one-to-one function of either coordinate, but p*(/x,a) = 

(<r, sign(jx)) will satisfy the conditions of Theorem 5.28. So X* is the MLE of M 2 . 

The UMVUE of M 2 is X 2 - W/n 2 , which is negative with positive probability. 

In exponential families, there is a simple method for finding MLEs in 
most cases. The logarithm of the likelihood function will be log£(0) = 
logc(0) + x T d. If the MLE occurs in the interior of the parameter space, 
it occurs where the partial derivatives of log L{6) are 0. That is, Xi = 
-<91ogc(0)/<90i. By using the method of Example 2.66, we see that the 
MLE is that 0 such that x = EeX. 

Example 5.31 (Continuation of Example 2.68; see page 106). Let Xi, . . . , X n 
be IID NU,a 2 ) given # = (h,ct)- The natural parameter of this exponential 
family is 9i = M/E 2 and 9 2 = -1/[2E 2 ]. The natural sufficient statistic is 
X = (nX,Y,i=i X i)- Now lo S c W = ^log(-2^)/2 + n0 2 /[40 2 ]. The partial 
derivative with respect to 0 X is n0i/[20 2 ] and the partial with respect to 0 2 
is n/[20 2 ] - n0?/[40?]. Setting these equal to thejiegatives of the two coordi- 
nates of X and solving for 0i and 0 2 give 9i = nXf £?=i( Xi - x ? and 02 = 
_ n /[2^ n _ i (Xi - X) 2 ]. In terms of the usual parameterization M = -9i/[29 2 ] 
andE 2 = - 1 l/[29 2 ], we get M = X and E 2 = ^^^-X^/n by Theorem 5.28. 



5.1. Point Estimation 309 



5.1.4 Bayesian Estimation 

Bayesian estimation tends to be somewhat more decision theoretic than 
classical estimation. If g is a function on the parameter space, N is the 
closure of #(0), and the loss function L(0,a) increases as a moves away 
from g(0), then one could reasonably be said to be estimating g(@). In 
Example 3.8 on page 146, we saw that if 6 is one-dimensional and if 
L(0, a) = (9 — a) 2 , then the formal Bayes rule is to use the posterior mean 
of G (so long as the posterior variance is finite). 

Example 5.32 (Continuation of Example 5.7; see page 298). Suppose that Pe 
says that X has Poi(0) distribution, and g(S) = exp(-30). If the prior distri- 
bution for 6 is T(a, 6), then the posterior after learning X = x is T(a + x,b + 1). 
The posterior mean of exp(— 3G) is 

Another popular loss function is L(6,a) = \0 — a\. The formal Bayes rule 
in this case is a special case of the following result. 

Theorem 5.33. Suppose that G has finite posterior mean. For the loss 

L(9,a)={ , % a l e > (5.34) 

v 1 \ (l-c)(0-a) ifa<6, v 

a formal Bayes rule is any l — c quantile of the posterior distribution o/G. 

Proof. Suppose that a! is chosen to be a 1 — c quantile of the posterior 
distribution of G. Then 

Pr(G < a'\X = x)>l-c, Pr(G > a'\X = x) > c. 

If a > a', then 

{c(a - a') if a! > 0, 

c{a - a') -{9 -a') \ia>9> a', 
(l-c)(a'-a) if 0>a 

{0 if a' > 0, 

a! - 6 if a > 9 > a', 
{a 1 -a) if 9 > a. 

It follows that the difference in the posterior risks is 

r(a\x)-r{a'\x) = c(a-a')+[ (a' - 9)f e]x (9\x)d\(9) 

J(a',a] 

4- (a' - a) Pr(G > a\X = x) 
> c(a - a') + (a' - a) Pr(G > a'\X = x) 
= (a - a')[c - Pr(G > a'\X = x)]. 



310 Chapter 5. Estimation 



Since Pr(6 > a!\X = x) < c, it follows that r(a\x) > r(a'\x). Similarly, if 
a < a', then 

!0 if a > 0, 

0-a if a' > 0 > a, 
(a' -a) if 0 > a'. 

It follows that r(a\x) — r(a'\x) > (a' — a)[Pr(G > a\X = x) — c]. Since 
Pr(9 > a\X = x) > Pr(6 > a'\X = x) > c, it follows that r{a\x) > r(a'|x), 
so a' provides the minimum posterior risk. □ 
Notice that Theorem 5.33 remains true if the loss in (5.34) is replaced by 

T(f) n\-i c ( a - if a > (9, 

MM) - | (1 _ c)( 0_ a) ifa < 0) 

even if © has a discrete distribution. The reason is that the loss is 0 when 
0 = a for both loss functions. As a corollary, we can let c = 1/2 in (5.34), 
and we get that the median is the formal Bayes rule for absolute error loss. 



5.1.5 Robust Estimation* 

A pragmatic approach to statistical inference will often allow for the pos- 
sibility that probability distributions used in modeling are not to be taken 
too seriously. For example, when we model data as conditionally IID with 
iV(/i,cr 2 ) distribution given 6 = (/^,cr), we might not be saying that this 
description is a precise specification of our beliefs, but rather it is an approx- 
imation that we hope will be sufficient for most purposes. Occasionally, the 
approximation is not sufficient. For example, if there is some small chance 
that one observation will be generated by a process much different from 
the others, we might wish to use a model that makes this belief explicit. 
Alternatively, we might want to use a procedure for estimating a parameter 
that is not sensitive to the occasional observation that comes from the dif- 
ferent process. This latter is the approach that leads to robust estimation. 
Of course, robust estimation is not concerned solely with occasional aber- 
rant observation, but it is also concerned with general misspecification of 
distributions. The approach to robust estimation outlined here originated 
with Huber (1964). 

Suppose that we will be estimating some functional T of the distribu- 
tion P of the data. For example, if P is a distribution with finite mean, 
then T(P) = f xdP{x) is the mean expressed as a function of P. Similarly, 
the median of a one-dimensional distribution P with continuous, strictly 
increasing CDF F is T(P) = F- l (l/2). The influence function of a func- 
tional T is a means of assessing the sensitivity of the functional to small 
changes in the distribution P. 



*This section may be skipped without interrupting the flow 



5.1. Point Estimation 311 



Definition 5.35. Let Vo be a collection of distributions on a Borel space 
{X,B) and let T : V 0 JR k be a functional. For each x e X, P e V 0 , 
t e [0,1), and B € 5, we define P Xtt (B) = (1 - + tI B {x). The 

influence function ofT at P is the following function of x: 

IF(x; T, P) = lim T (Px,t) - T(P) ^ 

for those x such that the limit exists. 

In particular, the influence function of T at P gives, for each x, the rate 
of change in T when P is contaminated by an infinitesimal mass at x. In 
other words, it is the right-hand derivative of T(P X t ) with respect to t at 
t = 0. 

Example 5.36. If T is the mean functional, then T(P x , t ) = [1 - *]T(P) + tx. 
It follows that IF(x;T,P) = x - T(P) for all x and P with finite mean T(P). 
Clearly, if P is contaminated by some mass at x = T(P), the mean will not 
change. Otherwise, the mean changes proportionally to how far x is from T(P). 
In a finite sample setting, suppose that we obtain n observations x\, . . . , x n and 
we contemplate one further observation x n+1 . We can think of the empirical 
CDF of the first n observations as a probability measure P and then with t — 
l/(n + 1), P* n+1 ,t will be the empirical CDF of all n + 1 observations. In this 
case, the difference between the sample averages T(P Xn+1 , t ) - T(P) is exactly 
[x n + 1 - T(P)]/(n + 1). 

Example 5.37. For the median functional, where F is the CDF of P, 

f^(fe) if.<Wfe), 

If we let P 0 be the class of distributions that have CDF with derivative at the 
median, then the influence function exists at all of 1R except the median of P For 
each P (with derivative of the CDF being /) and all x not equal to the median, 

IF(x; t< P) = L_J 1 tf *>*-m 

2/(F-i(I)) \ -i ifx<F->(I). 

For all x less than the median, the effect of a contamination at x is essentially the 
same. This is similar, for all x greater than the median. In a finite sample setting, 
suppose that we obtain n observations x, , . . . , x n and we contemplate one further 
observation x n+ i. We can think of the empirical CDF of the first n observations 

88 f £™ t ' n y me f T P and then With * = V(»+l), ft B+I .« will be the empir- 
ical OD1- of all n + 1 observations. In this case, the difference between the sample 
medians T(P^ n+ut )-T(P) has slightly different forms depending on whether n is 
odd or even. For the case of n odd, T(P) = x ([n+1]/2) , the [n + 1]/2 order statistic. 

ln»nl^f n ft 1,/2 V ) ' tl f n T( ^n ,,t) = [ x (("+D/2) +a; ([n+i|/2+i)]/2, which is in- 
dependent of the value of x n+1 (so long as it is larger than x ((n+1)/2+1) ). A similar 
expression holds for x n+1 < x ([n+1)/2 _ 1) . So T(P Xn+ut ) - T(P) is approximately 



312 Chapter 5. Estimation 



the difference between X([ n +i]/2) and the next observation either above or be- 
low it. If the distribution is continuous, then this difference is approximately one 
over two times the density at the median times 1/n, l/[2n/(F _1 (l/2))], which 
is approximately tIF{x n +\',t,P). 

One way to summarize the influence function is by the gross error sen- 
sitivity. This is defined as 7*(T,P) = sup x \IF{x\T, P)|. If 7*(T,P) is 
infinite, then there is no bound on how much T can change when P is 
contaminated by even a small amount of mass at an arbitrary point. 

Example 5.38. For the mean functional T, 7*(T, P) is the largest absolute devi- 
ation possible from the mean. For distributions with unbounded support, 7* = 00. 
For the median, on the other hand, 7*(T, P) = [2/(F- 1 (l/2))]" 1 , which is finite. 
For this reason, we say that the median is more robust with respect to gross 
errors than the mean. 

There is a way to derive estimators with specified influence functions. 
Let Vq be a class of distributions on and let T : V 0 -* H* be a 

fc-dimensional functional of interest. Let tf> : X x JR k — ► JR k be a vector- 
valued function. Assume that i/j(x, 0) is differentiable with respect to 9 at 
0 = T(P) a.s. [P] for each P e Vq. Suppose that the mean of tp(X,T(P)) 
is 0 for all distributions P e Vq. That is, 



/ 



^,T(P))dP(y) = 0, (5.39) 



for all P G Vq. For each x e X, P € Vq, t e [0,1), and B € #, we 
define P x ,t(B) = (1 - t)P(B) 4- Hb(x), and we assume that the mean of 
i/>(X,T(P Xtt )) is 0 also. That is, 

(1 - t) J ^(y, T{P Xtt ))dP(y) + tyix, T(P Xtt )) = 0. (5.40) 

Subtracting (5.40) from (5.39) gives 

Wy, T(P)) - My, T(P x , t ))\ dP(y) (5.41) 

i>{x,T(P Xtt ))- J 4>(v,T(P x , t ))dP(v) . 



= £ 



Suppose that we can differentiate / ip{y,T(P x j))dP(y) with respect to t 
by differentiating under the integral sign. Dividing both sides of (5.41) by 
t and taking the limit as t — * 0 gives the derivative at t = 0. Since ^ is 
continuous in its second argument, and since T(P x>t ) is continuous at t = 0 
for all x at which the influence function exists, 



toa.JiKv,T(P x ,t))dP(y) - 0, 



Jim iP(x,T(P x , t )) = *Kx,T(P)) 



5.1. Point Estimation 313 



if the influence function of T exists at x. So, the limit as £ — > 0 of l/£ 
times the right-hand side of (5.41) is ip(x,T(P)). Since T is fc-dimensional, 
the influence function will be a vector with coordinates IF(x;T J P) jj for 
j = l,...,fc. The limit as t — ► 0 of l/£ times the zth coordinate of the 
left-hand side of (5.41) is 



k 

IF{x- 1 T,P) j dP{y). (5.42) 

:T(P) 



3 = 1 J 



Define the matrix M = ((m^)), where 



/d 



dP(y). 

9=T(P) 



If we assume that the matrix M is finite and nonsingular, we can set (5.42) 
equal to ipi(x,T(P)) for each i and collect the resulting equations into a 
vector equation -M[IF(x; T, P)} = ip(x, T(P)), so that 

IF(x;T,P) = -M-V(x,T(P)). 

For an empirical distribution P n , T{P n ) = T n , where T n solves the equa- 
tion 

1 n 

-£lK*i,r n ) = 0. (5.43) 
i=i 

Estimators that solve equations like (5.43) are called M -estimators because 
they are generalizations of maximum likelihood estimators in the following 
sense. If p is a function such that dp(x, 8)/d0i = A{x, 9), then to maximize 
2-,k=\ P(Xk,T n ) it is necessary (but not sufficient in general) that (5.43) 
hold (if the maximum does not occur at a boundary point). One can think 
of p(x,0) as a replacement for log/ X | e (s|fl). In this way, an M-estimator 
is a generalization of an MLE. 

Example 5.44. Let X u X n be conditionally IID with density f Xl ie(x\8) 
given 0 = -0 Let P 0 = {P e : 6 e fi} and let T(P) = 6 if P is p/ Suppose 
L ? ' Let ^°l= dlogf XllB ( x \e)/ d e u the score function. If f Xl]e 

m v ™ y Sm00 »^' We haVe that the ^-estimator corresponding to V is the 
™ atrl ^ M 18 the u Fisher information matrix, so the influence function 
the score function " * ^ ° f the information matrix times 

thet^£ s e E£^n : (M,E) and f *^ M * the ^> m ^ 

(See Example 2.83 on page 112.) The Fisher information matrix is 

*.m-( t S). 



314 Chapter 5. Estimation 



So, the influence function of the MLE T is 

JF(x;T,P)= ( jj^ £ ) (5.45) 

\ 2a 2 / 

if P is the N(fi t a 2 ) distribution. In fact, one can verify directly that if P is 
any distribution with finite variance and T'(P) is the standard deviation, then 
IF(x;T\P) is the second coordinate in (5.45). (See Problem 32 on page 342.) 

Example 5.46. For an example that does not meet the smoothness criteria, 
consider Xi, . . . , X n conditionally IID with [7(0, 0) distribution given 0 = 0. Let 
Vo be the class of distributions on (JR.B 1 ) with bounded support, and let T(P) 
be the supremum of the support. The MLE is the maximum of the sample, which 
is the supremum of the support of the empirical CDF, so the MLE is T of the 
empirical CDF. The influence function for T is IF(x;T, P) = 0 if x < T(P) and 
oo if x > T(p). (See Problem 33 on page 342.) 

One famous M-estimator of a one-dimensional location parameter is 
based on the function t/>(#, 9) = h(x — 0), where 

!-b \it< -6, 
t if -b < t < b, 
b if t > b. 

If P is a continuous distribution, then ip{y,9) is differentiable at 6 = T(P) 
with probability 1. The influence function will be 

IF( X ;T,P)= 



P([T(P)-b,T(P) + b}Y 



(k) 



Notice that as b -» 0, the influence function approaches the influence func- 
tion of the median. The finite sample version is the estimator T n which 
solves X^ =1 il){Xi,T n ) = 0. As with all M-estimators, one can rewrite this 
equation as w n}i (Xi - T n ) = 0, where w n ^ = i>{X u T n )l{Xi - T n ). It 
is not difficult to see that T n = £" =1 w^Xij JXi ™n,i solves the e( l ua - 
tion. Since T n appears on both sides of this equation a solution must be 
found iteratively. For example, make an initial guess T^ 0) and define 

fl>(X u Tji k ~ l) ) 
Xi -Tk k - l) 

for k > 1 until convergence occurs. In the special case we are considering, 
the weights w n ^ have a nice form: w nii = 1 if \Xi - T n \ < b and w n ,i = 
b/\Xi-T n \ otherwise. 

Other types of robust estimators include trimmed means. The 100a% 
trimmed mean of a distribution is the conditional mean given that the 



5.2. Set Estimation 



315 



observation falls between the a and 1 - a quantiles of the distribution. 
That is, if F is the CDF corresponding to P, 



for a < 1/2. For a = 1/2, the tradition is to call the median the 50% 
trimmed mean. The influence function of a trimmed mean at a continuous 
distribution is bounded and it has a shape similar to that of the previous 
estimator. (See Problem 34 on page 342.) 

In Section 7.3.6, we will give some results concerning large sample prop- 
erties of M-estimators. More detailed discussion of robust estimators can 
be found in the books by Huber (1977, 1981) and Hampel et al. (1986). 
In Section 8.6.3, we discuss robustness considerations that are peculiar to 
the Bayesian perspective. 

5.2 Set Estimation 

A set estimator of a function g of a parameter 0 is a function from the 
data space X to a collection of subsets of the space in which g{&) lies. 

5.2.1 Confidence Sets 

In Section 4.6, we introduced the P-value as an alternative to testing a 
hypothesis at a fixed level. In nice problems, the P-value gave us the set of 
all levels at which we could accept the hypothesis. Another alternative is 
to fix the level and ask for the set of all hypotheses that we could accept 
at that level. This leads to the concept of a confidence set. 

Definition 5.47. Let g : — > G be a function, let tj be the collection of 
all subsets of G, and let R : X — ► rj be a function. The function R is a 
coefficient 7 confidence set for g(Q) if for every 0 G 

• {x : g(0) G R(x)} is measurable, and 

• P^(g(0) G R(X)) > 7. 

The confidence set R is exact if P' e {g{0) e R(X)) = 7 for all 0 G Q. If 
inf0 e n Po{9{0) G R(X)) > 7, the confidence set is called conservative. 3 

The following result shows how confidence sets relate to nonrandomized 
tests in general. Its proof is left to the reader. 



3 Some authors require that P^(R(X) = 0) = 0 for all 9 before calling R a 
confidence set. Some require that inieea P$(9(0) G R{X)) = 7 before saying that 
the coefficient is 7. 



T(P) = 




'(<») 



xdP{x) 



l-2a 



316 Chapter 5. Estimation 



Proposition 5.48. Let g : Q — > G be a function. 

• For each y € G, let <f) y be a level a nonrandomized test of H : g(Q) = 
y. Let R(x) = {y : <l> y (x) = 0}. Then R is a coefficient 1 - a confi- 
dence set for g{Q). The confidence set R is exact if and only if (j) y is 
a- similar for all y. 

• Let R be a coefficient 1 - a confidence set for g(Q). For each y € G, 
define 

r yeR(x), 



W ~ \ 1 oil 



^ yK ^ J 1 1 otherwise. 

Then, for each y, </> y has level a as a test of H : g(Q) = y. The test 
(j> y is a- similar for all y if and only if R is exact. 

Example 5.49. Let X\, . . . , X n be conditionally IID with N(/z,<r 2 ) distribution 
given (M, E) = (/z, a). Let X = (X u • . . , X n ). The usual UMPU level a test of 
H : M = y is </> v (x) = 1 if y/n(x - y)/s > T~} x (l - a/2), where T n -i is the 
CDF of the t n _i(0, 1) distribution. This translates into the confidence interval 
[x - T-_\(l - a/2)a/Vn,5 + ^(1 - a/2)a/y/H\. 

Example 5.49 is typical of the most popular way to form confidence sets, 
namely the use of pivotal quantities. A pivotal is a function h : X x fi — > JR 
whose distribution does not depend on the parameter. That is, for all c, 
Peih^X^O) < c) is constant as a function of 6. In Example 5.49, the pivotal 
is y/n(X - M)/5, which has £ n -i(0> 1) distribution given 6. The general 
method of using a pivotal h(X, 0) to form a confidence set is to set R(x) — 
{9 : h{x,9) < F~ 1 (7)} ) where F h is the CDF of fc(X,0). 

We can define randomized confidence sets if we want a correspondence 
to randomized tests. 

Definition 5.50. Let g : ft -> G be a function, and let R* : X x G -> [0, 1] 
be a function such that 

• R*(-, y) : X -+ [0, 1] is measurable for all y € G, and 

• E e [R*(X,g{9))} > 7, for all 9 € fi. 

Then R* is called a coefficient 7 randomized confidence set for #(©). 

The number R* (x, j/) is to be thought of as the probability that y is included 
in the confidence set given that X = x is observed. 

Example 5.51. Suppose that X ~ Bin(2,0) given 9 = 9. Let g(9) = 9 for all 9. 
Define 

9 max{o,l-(£^} ifx = 0, 

1 ' ' if x = 1 and 9 < 1 - 

j max { 0 ,l-^Ii^} ifx = land^>l-VO05, 
min{l,l-^^} if* = 2. 



5.2. Set Estimation 317 



It is easy to check that 

E*[ir(X,0)] = O.95, 

for all 6. So , if X = 1 is observed, the confidence set consists of t he int erval 
[0, 1 — v^0.05] together with possibly some 0 values between 1 — \/0.05 and \/0.95 
chosen with decreasing probability as 0 increases. For convenience, we could select 
a single U ~ U(Q, 1) independent of X and, if X = 1 is observed, include 0 in the 
confidence set if 

O.O5-(l-0)* 
2(9(1 - 0) 

For example, if U = 0.5 is observed, the confidence set becomes [0,0.95]. This 
example is a special case of Proposition 5.52 below. 

Randomized tests correspond to randomized confidence sets in a manner 
similar to Proposition 5.48. The randomized confidence set in Example 5.51 
was constructed according to Proposition 5.52 using the UMP level 0.05 
tests of H : © = 9 versus A : 0 < 9. 

Proposition 5.52. Let g : Q —> G be a function. 

• For each y e G, let <f> y be a level a test of H : g(G) = y. Let 
R *( x ,y) = l - <t> y (x). Then R* is a coefficient 1 - a randomized 
confidence setforg(Q). The randomized confidence set R* is exact if 
and only if <\> y is a- similar for all y. 

• Let R* be a coefficient 1 - a randomized confidence set for g(Q). For 
each y eG, define <t> y (x) = 1 - R*(x,y). Then, for each y, <t> y has 
level a as a test of H : #(0) = y. The test <j> y is a- similar for all y if 
and only if R* is exact. 

The concept of UMP test corresponds to the concept of uniformly most 
accurate confidence set. 

Definition 5.53. Let g : -+ G be a function, and let R be a coefficient 7 
confidence set for ^(0) . Let 77 be the collection of all subsets of G, and let B : 
G -> 77 be a function such that y B(y). Then R is uniformly most accurate 
(UMA) coefficient 7 against B if for each 9 e Q and each y e B(g{9)) and 
each coefficient 7 confidence set T for ^(0), P' e (y e R(X)) < P' e {y e T{X)). 

If R* is a coefficient 7 randomized confidence set for #(0), then R* is 
UMA coefficient 7 randomized against B if for every coefficient 7 random- 
ized confidence set T* and every 9 e f) and each y € B(g(9)), 

MR*(X,y)}<Eo[T*(X iy )). 

The accuracy of a confidence set against B is its probability of not covering 
parameter values in B(g(9)) given 9 = 9. The set B(g(9)) is the set of 
values you wish not to have in your confidence set if 0 = 9. It is not 
exactly analogous to the alternative in hypothesis testing, but it is related, 
as we will see in Theorem 5.54. 



318 Chapter 5. Estimation 



One can consider a confidence set R as a randomized confidence set by 
setting 

R*(x v) = l 1 iiy£ R{X) ' 
n \ x >V>-\ 0 if not. 

For this reason, we state the following result in terms of randomized con- 
fidence sets. 

Theorem 5.54. Let g(0) = 0 for all 0 and let B : ft — ► rj be as in Defini- 
tion 5.53. Define 

B' l (6) = {0 f :0eB{0')}. 

Suppose that B~ l {0) is nonempty for every 0. For each 0 G ft, let <\>e be 
a test. Define R*(x,0) = 1 — (f>e{x). Then <pe is UMP level a for testing 
H : 6 = 0 versus A : 6 G B~ x (0) for all 0 if and only if R is UMA 
coefficient 1 — a randomized against B. 

Proof. For the "only if" part, suppose that for each 0, <\>e is UMP level 
a for testing H e : © = 0 versus A e : 0 G B~ l (0). Let T* be another 
coefficient 1 - a randomized confidence set. Let 0 G ft and 0' G B(0). All 
that remains is to show that E e [R*{X,0')} < E e [T*(X,0')]. First, note that 
0 G B~ l {0'). Now, define a test ip(x) = 1 - T*(x,0'). This test V has level 
a as a test of Ho*, according to Proposition 5.48. Since 4>q> is UMP as a 
test of H e > against the alternative A e > : 9 G B~ l {0'), and 0 G B- l (0'), it 
follows that (3^(0) < A/> d ,(0). We can rewrite this as 

M0) = E e mX)) = l-Ee[T*(X,0')]< (3^,(0) = MM*)] 

= l-E e [R*(X,0% 

which establishes the result. 

For the "if" part, suppose that R* is a UMA coefficient I- a randomized 
confidence set against B. For each 0 G ft, let ipe be a level a test of H e : 
6 = 0 and define T*(x,0) = l-ip 9 (x). Then Proposition 5.52 shows that 
T* is a coefficient 1 - a randomized confidence set. Let 

& = {{0',0):0' efl,0eB{0')} 
= {{0,0'):0eSl,0' eB-\0)}, 

where the second equality follows since B~ l {0) is nonempty for all 0 6 ft. 
For each {0,0') G ft', we know that E*/[iT(X,0)] < E e >[T*(X,0)\. This is 
the same as ^{0') > for all 0 G SI and all 0' G B~ l (0). This last 

claim means that 0# is UMP level a for testing He versus A? : 6 G 5 (0). 
□ 

Example 5.55. Suppose that Xi, .... X„ are IID N{», 1) given M = /i. Due to 
the continuity of this distribution, we will not need to consider randomized tests 
and confidence sets. Let 



R(x) = ^-oc,x+ l (l-<x) 



5.2. Set Estimation 319 



We note that 



p> e r(x)) = p'Jfx<x + 




so that R is an exact coefficient 1 — a confidence set. Consider the test (j>^{x) = 1 
if x < ix-<b- l (l-a)/^/n. Then R(x) = {/x : = 0}, and <£ M is the UMP level 

a test of H : M = fj, versus A : M < /i. So B~ l {fi) = (-oo,/i), = (m>°°), 
and R is UMA coefficient 1 — a against B. That is, if < fi\ then R has a 
smaller chance of covering // than does any other coefficient 1 — a confidence set, 
conditional on M = /i. 

The following proposition (whose proof is left to the reader) illustrates 
why we do not need to introduce a dual concept to UMA confidence sets 
corresponding to UMC tests. 

Proposition 5.56. For each 0 eft, let (/> e be a floor 7 test of H : 6 e tie 
versus A : 9 = 0, where 0 <£ n e . Let R*(x,0) = 1 - </> e (x). If fa is UMC 
floor 7 for testing H : 0 E B~ l (6) versus A:Q = 0 for all 0eQ, then R* 
is UMA coefficient 7 randomized against B. 

In other words, UMA confidence sets correspond to both UMP and UMC 
tests, just in different ways. 4 

The following example is due to Pratt (1961) and it illustrates an inad- 
equacy in the theory of confidence intervals as described above. 

Example 5.57. Suppose that Xi, . . . , X n are IID 1/(0-1/2, 0+1/2) given 6 = 0. 
Minimal sufficient statistics are T x = minX* and T 2 = max X t . 

/T 1 ,T 2 |e(<i,<2|0) = n(n-l)(t 2 -ti) n - 2 , for 0 - \ < t x < t 2 < 0 + \. 

Suppose that B(0) = (-00, 0) and that we want the UMA coefficient 1 - a 
confidence set against B. If, for each 0 O , we find the UMP level a test of Q H = 
(-00, 0 O ] versus = (0o,oo) = B 1 (0 O ), we can use these tests to construct the 
UMA coefficient 1 - a confidence set. This is not an exponential family and it 
does not have MLR, since the sufficient statistic is two-dimensional. To find the 
UMP level a test (and hence the UMA coefficient 1 - a confidence set), we use 
the Neyman-Pearson lemma. First, let 0i > 0 O . For k < 1 and 0i < 0 O -f 1, 



if ti > 0i - 1/2 or if t 2 > 0o + 1/2 (see Figure 5.59). If k = 1, then (5.58) changes 
to equality on the shaded set in Figure 5.59. If 0 X > 0 O + 1, then (5.58) holds for 

I ~c 2 GVery k - °' T ° make the test have size a and to make ft °e 
the same for all 0i > 0 O , we must set <t> = 1 in the upper corner (shaded region in 
Figure 5.59), filling in a large enough area to have probability a. This would be 



M,r 2 |e(*i,*2|0i) > kf Tli T 2le (t u t 2 \0o) 



(5.58) 



J>{t u t 2 ) = { 1 tf^o + ^or^o + I-ai, 



( 0 if t 2 < 0 O + \ and h < 0 o + \ - a* . 



4 Note that the hypothesis and alternative need to be switched in order for the 
same confidence set to correspond to both a UMP and a UMC test. 




Figure 5.59. Construction of UMP Test in Uniform Example 



(To see that this is MP for each 0i > 0 O , note that we choose k = 1 in (5.58) if 
6i - 1/2 < 0 O + 1/2 - a 1/n , and we choose A; = 0 if 0i - 1/2 > 0 O + 1/2 - a 1/n .) 

The UMA coefficient 1 - a confidence set against B is [T*,oo), where T* = 
max{Ti - 1/2 + a 1/n ,T 2 - 1/2}. This means that P' e (0' > T*) is minimized for all 
0' < 0 among all coefficient 1-a confidence sets. Note, however, that 0 > T 2 — 1/2 
for sure, and T* < T 2 - 1/2 whenever Ti - 1/2 + a l/n <T 2 - 1/2, that is, when 
T 2 -Ti > a 1/n . So, we are 100% confident that 0 > T* whenever T 2 -Ti > a 1/n , 
rather than 100(1 - a)% confident. (The probability that T 2 - Ti > a 1/n is 
1 - [n/a l/n - n + 1]. If a = 0.05 and n = 10, then P' e {T 2 - Ti > a 1/n ) = 0.77 
for all 0. So the understatement of confidence will occur with probability 0.77 no 
matter what 0 is.) 

But there is more. Suppose that we switch to B'(0) = (0, co), so that the 
corresponding hypothesis and alternative are H' : 0 > 0o versus A' : 0 < 0q. The 
UMA coefficient 1— a confidence set against B' is (— oo, T*], where T* = min{Ti + 
1/2, T 2 + 1/2 - a 1/n }. (See Problem 31 on page 289.) If T 2 - Ti < 2a 1/n - 1, then 
the two intervals do not even overlap. For example, if a = 0.05 and n — 10, the 
probability of this occurrence is 0.008. As an example, if T\ — 1 and T 2 = 1.3, 
then T* = 1.24 and T* = 1.06. So we are 95% confident that 0 < 1.06 and we 
are 95% confident that 0 > 1.24. That makes us 190% confident, and we haven't 
even covered all the possible values of 0. The most straightforward way out of 
these dilemmas in the classical framework is to condition on the ancillary T 2 - T\ . 
This will produce more sensible results, which will also be more closely in line 
with a Bayesian analysis. The confidence set produced will not be UMA, however. 
Another alternative is to use the UMA confidence interval but assert a confidence 
coefficient a(t 2 - t\) that depends on the observed value of the ancillary. 5 



5 A similar approach was suggested by Barnard (1976) in response to the claim 
by Welch (1939) that conditional confidence intervals are inefficient. Welch was 
pointing out (as we noted above) that conditional intervals, with the same confi- 
dence coefficient for every value of the ancillary, are less efficient (e.g., not UMA, 
or longer on average) than intervals based on the marginal distribution of the 
data. What Barnard showed was that one can fix the marginal confidence coeffi- 
cient to be 1 - a and then choose the conditional confidence coefficient in a way 
to optimize whatever criterion one desires. 



5.2. Set Estimation 321 



Example 5.57 is a situation in which the distributions satisfy a condition 
called invariance, which will be discussed in Chapter 6. In Section 6.3.2, 
we will prove a theorem that says that in such cases, the posterior proba- 
bility that a parameter lies in a confidence set is equal to the conditional 
confidence coefficient given the ancillary. 

For two-sided or multiparameter confidence sets, we need to extend the 
concept of unbiasedness. 

Definition 5.60. Let g : — ► G be a function, and let R be a coefficient 
7 confidence set for g(Q). Let rj be the collection of all subsets of G, and 
let B : G — > rj be a function such that y 0 B(y). We say that R is unbiased 
against B if, for each 0 e fi, P' B (y G R(X)) < 7 for all y e B(g{0)). We say 
R is a uniformly most accurate unbiased (UMAU) coefficient 7 confidence 
set for g{Q) against B if it is UMA against B among unbiased coefficient 
7 confidence sets. 

Proposition 5.61. For each 6 G ft, let B(0) be a subset of Q such that 
0 B(0), and let <$>e be a level a test of H : 9 = 6 versus A : 6 € B~ l (6) 
such that <j)0 is nonrandomized. Let R(x) = {9 : <j)o(x) = 0}. Then R is a 
UMAU coefficient 1 — a confidence set against B if (j>e is UMPU level a 
for testing H versus A. 

The following example shows that the phenomenon of Example 5.57 can 
occur in exponential families with UMA unbiased confidence sets. This 
example is due to Fieller (1954). 

Example 5.62. Let 6 = (Bo,Bi,£), and suppose that Yi,...,Y n are condi- 
tionally independent given 6 = (/3o,/3i,cr) with Yi ~ N(/3o + (3iXi,a 2 ), where 
x 1 , . . . , x fi are known numbers. Sufficient statistics are 

n 

W = ^2(Y i -Y-B 1 (x i -x)) 2 . 
If we let S xx = 5^r=i( Xi ~^) 2 > ^ ften the joint density of the sufficient statistics is 

x exp(-^2 {<V " ft - Pix) 2 + S xx 0i - ft) 2 + w}) • 

Suppose that we want a confidence interval for the value of x that makes 
B 0 + Bix = 0. Let H = -B0/B1 be that value. We will test H : H = x 0 by 
testing (Bo + Birco)/S 2 = 0. The natural parameters and corresponding sufficient 
statistics are 



322 Chapter 5. Estimation 



Natural Parameter 


Sufficient Statistic 


03 = -2^? 


Ti 
T 2 
T 3 


= nY 

= S XX B\ 

= W + S xx Bl + nY 2 



Now, set ^ 2 = 02, ^3 = 03, and #i = (B 0 4- Bia:o)/S 2 = 0i + 0 2 (x o - x)- The 
new sufficient statistics are 

Ui = nY, U 2 = S xx Bi - (x 0 - x)nF, and t/ 3 = T 3 . 
The inverse of the transformation from (Y, Bi, W) to (t/i, t/2, t/3) is 

[li 2 + (#0 - X)U\} 2 U\ _ Ul x U2 + (Xo — X)U\ 

w = u 3 - - , 2/=—, Pi = ^ , 

and the Jacobian is l/S xx . The joint density of the Ui& given ^ = -0 is 



1 \* 



Vn5^r(f - l) 



[i*2 + (#0 - #)ui] 2 
u 3 ^ 

71 Jxx 



?- 2 



x exp(un/;i)^(iX2,w 3 ,V)/(o,oo)(^), 

where p is some function that will not concern us. The conditional density of U\ 
given ^ 1 = 0 and (t/2, t/3) = (^2,^ 3 ) is 



7T Sxx 



\-2 



I(Q,oo){w)h{u2,U2), 



where /i is a function that will not concern us. If we expand out the formula for 
w and complete the square in this function as a function of u\ , we get 



r{u 2 ,u 3 ) 



I 



1 - 



I (so - x) 2 

Tl Sxx 



XQ-X 



\ 



U 3 - 



nS xx 



i4 



n in 



/ 



/(0,oo)M- 



So, the conditional distribution of Ui given U 2 and t/ 3 is a location and scale 
family and 



-4- 

t/ 3 



4 



is independent of t/ 2 and t/ 3 . The numerator of this expression is 

Bp + Bixp 
1 1 (*o-*) 2 ' 

„ 1 c 



(5.63) 



5.2. Set Estimation 323 



where Bo = Y — Bi#. The denominator is 

Bo 4- B1X0 



W + 



The usual t statistic is 



I 4. (*<>-*) 

H Sxx 



Bo + Bixo 



(5.64) 



Y n-2 Y n ^ S xx 



The ratio of (5.63) to (5.64) is [Vv'n^] + * 2 /(n - 2), whose absolute 
value is an increasing function of Hence, the usual t-test of H : B 0 + Bi£o = 0 
is the UMPU level a test of #i = 0. Hence a UMAU coefficient 1-a confidence 
set for H is 



(Bp + Bi*) 2 

xfi i («-«) 2 A 

n-2 ^ S xx y 



^^^(l-a) 



where Fi, n _ 2 is the CDF of the F distribution with 1 and n-2 degrees of freedom. 
We will now find all z that satisfy this inequality. Define 



y/n(z - x) 



'Sxx 
o2< 



F _ (n - 2)/? 2 S xx 
' 



0i V S xx 
c = F^_ 2 (l-a). 



Clearly v is a one-to-one function of * and v* is its naive estimator. The usual F 
statistic for testing Bi = 0 is F. We can write 



A) + Piz = y + 0 X ( Z - x) = y ^ A(v - O, 

A (*-*) 2 \ = 14.^ 
rc-2\n 4x j n-2 n ' 

So, the confidence set is {z : F{v-v*) 2 /{\W) < c}. Now, F(v-v*) 2 /(lW) < c 
if ana only if ' ' v ' — 



v 2 (F - c) - 2viV + (Ft;* 2 - c) < 0. 
We have three cases depending on the sign of F - c. 
1. F 



(5.65) 



^ / T *2 6n ! S j * St barely si 6 nific ant at level a. In this case (5.65) holds 
if v£ (v - l)/2„* and „• > 0 or if „ < (<;** 1)/2 ^ and > < ; Q The 

confidence set will be a semi-infinite interval. 

2. F < c. Then 6 X is insignificant at level a. In this case the quadratic has a 
negative coefficient for v 2 . The maximum occurs at v = Fv* l(F - c) and 
the maximum value is c[c - F(v* 2 + 1)]/(F - c). If this maximum value is 
positive, then the confidence set is the exterior of a bounded open interval 
It the maximum value is negative (equivalently if F < c/(l 4- v* 2 ) which 
can happen), the confidence set is (-oo, oo), which is absurd. 



324 Chapter 5. Estimation 



3. F > c. Then Bi is significantly different from 0 at level a. The minimum 
of the quadratic is always negative and the coefficient of v 2 is positive, so 
the confidence set will be a closed and bounded interval. 

In this example, even though the confidence set satisfies the conditions of being 
UMAU, it would not be sensible to use it after observing data. 



5.2.2 Prediction Sets* 

One attempt to do predictive inference in the classical setting is the con- 
struction of prediction sets. 

Definition 5.66. Let V : S — ► V be a random quantity. Let 7] be the 
collection of all subsets of V, and let R : X -* rj be a function. This 
function is a coefficient 7 prediction set for V if 

• {(x,v) : v E R{x)} is measurable, and 

• P' B {V e R(X)) > 7, for every 6 e ft. 

The prediction set R is exact if P' e {V € R{X)) = 7 for all 0 G ft. If 
infflen P' e {V e R(X)) > 7, the prediction set is called conservative. 6 

Example 5.67. Suppose that {X n }n=\ are IID N(^(t 2 ) given (M,E) = (fa<r). 
Suppose that we will observe X = (Xi, . . . ,X n ) and that we are interested in 

v = Etn+i x '/ m - Let *» = Er=i and s » = £7=1 (** - *n) 2 /(n - 1). 

Since 

V-X n ~N(oy [- + -]) 
and is independent of Si ~ T([n - l]/2, [n - 1]/[2<t 2 ]), it follows that 

V ~ X " ~t„_ 1 (0,l). 



Define the set function R{X) to be the interval 



X. - (. - f) &V ^TI, x, + (. - f ) 

It is easy to see now that P' B {V € fl(X)) = 1 — a for all 0 E ft, so R is an exact 
coefficient 1 - a prediction set for V. 

There are also one-sided prediction sets. For example, 



R'(X) = 



Xn-T-^a 



■Q)5n\/- + — ,C 

7 V n m 



also satisfies € #(X)) = 1 - a for all 0 6 ft. 



*This section may be skipped without interrupting the flow of ideas. 

6 Some authors require that P' e {R{X) = 0) = 0 for all 0 before calling R a 
prediction set. Also, some authors require that inf* € n Pe(V G RW) ~ 7 beIore 
calling 7 the coefficient. 



5.2. Set Estimation 325 



Since tests and confidence sets correspond in a natural way, we might 
expect prediction sets to correspond to predictive tests in a similar way. 

Example 5.68 (Continuation of Example 5.67; see page 324). Suppose that we 
set up the predictive hypothesis H : V < vo versus A : V > vq. To parallel the 
relationship between parametric tests and confidence intervals, we should reject 
H if vo R'(X). What properties does this predictive test have? If we try to 
generalize the type I error probability, we might try to calculate 

Pe (Reject H|V = vq) = P e (Reject H) 

where the first equality follows from the fact that V and X are conditionally 
independent given 0. This probability can be calculated using the noncentral t 

distribution. For \i = vo, the probability is 1 - T n -i ^"^(l - a)\/l + n/rn^j < 

a. The test is easily seen to be a UMPU test (level strictly less than a) for 
M < vo, but it is not clear why it should be considered a test of V < vo. 

5.2.3 Tolerance Sets* 

A classical alternative to a prediction set is a tolerance set as introduced 
by Wilks (1941). 7 

Definition 5.69. Let V : S — » V be a random quantity. Let n be the 
collection of all subsets of V, and let R : X — > 77 be a function. This 
function is a 6 tolerance set with confidence coefficient 7 for V if 

• {(x,v) : v € R{x)} is measurable, and 

• Pe( p e\ v e R(X)\X] > 6) > 7, for every 0 ett. 

The number 6 is called the tolerance coefficient. The tolerance set R is exact 
if P' e {P' e [V e R(X)\X] > 6) = 7 for all 9 € ft. One might wish to require 
that P' e {R{X) = 0) = 0 for all 0 and/or inf* €n Pe( p el v G R ( x )\ x \ > s ) = 
7. If this last condition fails, the tolerance set would be called conservative. 

Rather than making a single probability statement concerning the joint 
distribution of the data X and the future observable V (as is done in a 
prediction set), a tolerance set tries to separate the probability statements 
about X and V. A conditional probability statement about V is made 
given X (the tolerance coefficient) and then a statement is made about the 
distribution of this conditional probability (the confidence coefficient). 

Example 5.70 (Continuation of Example 5.67; see page 324). Here the data are 
conditionally IID N(fj,,a 2 ) given B = (/x,a). Suppose that we want R(x) to have 



*This section may be skipped without interrupting the flow of ideas. 
7 For more detail on tolerance sets, see Aitchison and Dunsmore (1975). 



326 Chapter 5. Estimation 



the same form as the prediction set, namely Rd(X) = \X n — dS n ,X n 4- dSn], 
where d is yet to be determined. Now, 

P' e {V € R d (X)\X) = <f (v^ Xn+< f n ~^ - * ^ Xn ~^ Sn ~^ . 

(5.71) 

Call the right-hand side of (5.71) Hd(X, /x, cr). It is easy to see that the distribution 
of Hd(X, /x, cr) given 0 = (/z, cr) is the same as the distribution of Hd{X y 0, 1) given 
6 = (0, 1). Let yd,*y be the 1 - 7 quantile of ifd(X, 0, 1). Since the distributions of 
/fd(X,0, 1) are stochastically increasing in d, it is clear that 3/4,7 is an increasing 
continuous function of d. Also, 1/0,7 = 0 an( l linid— 00 2/d,7 = 1. Let d be such that 
2/d,7 = With this choice of d, we have P' e (Hd{X, zx, cr) > 6) = 7. It follows that 
Rd is a 6 tolerance set with confidence coefficient 7 for V. 8 
There are also one-sided tolerance sets. For example, 

Rd(X)= [X n -dS n , 00) (5.72) 

satisfies Pe(P$[V G Rf d (X)\X] >_6) = 7 for all 0 6 ft if d is chosen so that 6 is 
the 1 - 7 quantile of 1 - $(y/m[X n - dS n ]) given 0 = (0, 1). 

One might think that tolerance sets could be used to construct predictive 
tests in the classical framework where prediction sets failed. In the sense 
of (3.15), they can. 

Example 5.73 (Continuation of Example 5.67; see page 324). Consider the hy- 
pothesis H : V < vo and the tolerance set R' d in (5.72). (Recall that V is the 
average of m future observations.) Here d is chosen so that 6 is the 1 — 7 quantile 
of the distribution of 1 - $(y/m[Xn - dS n ]) given 0 = (0, 1). We could reject H 
if vo 0 R'd(X). We now calculate 

^(Reject H) = P'e (vo < X n - dS n ) = P>. < ~ g ~ ^ ) 

= P(0, 1 )(^<^-^n) 

= P( 0 ,l)(l-*(v^Pfn-^n])<l-*(V^ ! ^))- 

This, in turn, is less than or equal to 1 - 7 if and only if 

i.^^^if) <S . (5.74) 

Note that the left-hand side of (5.74) is 1 - Pe(V < «o). So, we have replaced 
the hypothesis H : V < v 0 by the hypothesis H' : Pe(V < vo) > 1 - 6. (For 
6 = c/(l + c), H' turns out to be the same as the hypothesis constructed m 
Example 4.13 on page 219.) 



8 Eberhardt, Mee, and Reeve (1989) give a program to compute the number d 
in examples like this. 



5.2. Set Estimation 327 



5.2.4 Bayesian Set Estimation 

In a Bayesian framework, set estimation is a type of inverse to the problem 
of computing posterior and/or predictive probabilities. 9 That is, rather 
than specifying a set and determining its posterior or predictive probability, 
you specify a probability and determine a set that has that probability. The 
problem is that there are usually many sets with the same probability. For 
example, suppose that the predictive distribution of V given X = x is 
continuous with CDF Fy\x{*\x). Suppose that you want an interval T such 
that Pr(V € T\X = x) = 7. One such interval is (-00, F^^^la;)], and 
another is [Fy* x (l — 7|x, 00), and there are many bounded intervals. 

To choose between the many possible sets, it might make sense to have a 
loss function and then choose the set with the smallest posterior expected 
loss. This approach is discussed in Section 5.2.5. More commonly, one of 
the following approaches is taken: 

• If V has a density fv\x{'\ x ), determine a number t such that T = 
{v : fv\x(v\x) > t} satisfies Pr(V E T\X = x) = 7. This choice is 
called the highest posterior density region, or HPD region. 

• For the case in which V is real-valued and one desires a bounded 
interval, choose the endpoints of the interval to be the (1 - 7)/2 and 
(1 + 7)/2 quantiles of the distribution of V. 

HPD regions are sensitive to the dominating measure. That is, if the 
conditional distribution of V given X = x is absolutely continuous with 
respect to two different measures, the HPD regions constructed from the 
two different densities might be different. 

Example 5.75. Suppose that V given X = x has N(x,l) distribution. This dis- 
tribution is absolutely continuous with respect to Lebesgue measure with density 
(27r) -1 / 2 exp(— [v — x] 2 /2). The corresponding HPD regions will be symmetric 
intervals around x because the density is a decreasing function of \v — x\. The 
iV(x, 1) distribution is also absolutely continuous with respect to the iV(0, 1) dis- 
tribution. The density is exp(a:t; — £ 2 /2), which is an increasing function of v 
for x > 0 and is decreasing for x < 0. The HPD region would be a semi-infinite 
interval in either of these cases. If x = 0, then the dominating measure is the 
same as the distribution of V and the density is the constant 1. In this case, every 
set is an HPD region. 

Even if there is no issue of which dominating measure to choose, HPD 
regions can be strange if the density of V is multimodal. In particular, they 
can be the union of several disconnected subsets. In such cases, one might 
prefer just to choose a reasonable shape for the region T and choose the 
particular region of that shape so that it is convenient to demonstrate that 
Pr(V € T\X = x) = 7. 



9 Sets with specified posterior probability are often called credible sets in the 
statistical literature. 



328 Chapter 5. Estimation 



5.2.5 Decision Theoretic Set Estimation* 

Just as we could choose point estimates to minimize a loss function, we 
could also choose set estimates to minimize a loss function. The problem 
is that there are many possible loss functions and one rarely suffers a loss 
according to one of the tractable ones. Nevertheless, we will derive the 
optimal rules for some simple loss functions. 

For a simple situation, suppose that we will form a semi-infinite inter- 
val of the form (— oo,a] for a one-dimensional parameter 6 with the loss 
function being 



If c < 1 — c, this loss penalizes overly long intervals that contain the pa- 
rameter less (per unit of length) than it penalizes short intervals that miss 
the parameter. If the posterior distribution of 0 has density f&\x{Q\ x ) with 
respect to a measure A, and the posterior mean of 0 is finite, then The- 
orem 5.33 says that the optimal a is any 1 — c quantile of the posterior 
distribution of 0. 

For bounded intervals, the action space can be considered to be the set 
of ordered pairs (01,02) in which a\ < a 2 . Consider a loss like (5.76) that 
penalizes excessive length above, below, and around 0 differently: 



The optimal interval is the interval between two quantiles of the posterior 
distribution. 

Theorem 5.78. Suppose that the posterior mean of 0 is finite and the 
loss is as in (5.77) with Ci,c 2 > 1. The formal Bayes rule is the interval 
between the 1/ci and l-l/c 2 quantiles of the posterior distribution ofO. 

Proof. We can rewrite the loss in (5.77) as Li(Mi) + £2(0,02), wh ere 



Since each of these loss functions depends on only one action, the posterior 
means can be minimized separately. If we divide Li by c u then Theo- 
rem 5.33 says that the posterior mean of Li(6,ai)/ci is minimized at a x 



'This section may be skipped without interrupting the flow of 




(5.76) 




(5.77) 




5.3. The Bootstrap 329 



equal to the l/c\ quantile of the posterior. Similarly, if we divide L 2 by c 2 , 
Theorem 5.33 says that the posterior mean of L 2 (9, a 2 )/c 2 is minimized at 
the (c 2 - l)/c2 quantile of the posterior. □ 

In the special case in which c\ = c 2 = 2/a > 1, the optimal interval runs 
from the a/2 quantile of the posterior to the 1 - a/2 quantile. This would 
be the usual equal-tailed, two-sided posterior probability interval for 8. 

There are other loss functions that do not penalize differently for how 
short the interval is when it misses the parameter. For example, 

Li{0,[a u a 2 ]) = a 2 - ai +c(l - J[ ai>tt2 ](0)) , 
L q (0,[a u a 2 )) = (a 2 -ai) 2 + c(l-/ [ai , a2] (^)). 

The existence of optimal rules in these cases actually requires weaker as- 
sumptions than those already considered, because the posterior mean of 
the loss is finite in all cases. That is, we need not assume that 0 has finite 
posterior mean to find the formal Bayes rules. If the posterior distribution 
of 6 has a continuous density with respect to Lebesgue measure, then cal- 
culus can be used to minimize the posterior risk. For example, the following 
result is easy to prove. 

Proposition 5.79. Suppose that Q C IR and that the action space is the 
Borel a -field r of subsets of Q. Suppose that the posterior distribution of 
© has a density f$\ X with respect to Lebesgue measure A and that the loss 
is L(0, B) = X(B) + c (1 - Jb(0)). Then, the formal Bayes rule is an HPD 
region of the form B(x) = {0 : fe\x{0\x) > 1/c}. 

For the case in which the density /e|x is strongly unimodal (that is, 
: fe\x{0\%) > a} is an interval for all a), the formal Bayes rule for loss 
L\ is an HPD region. In the strongly unimodal case, one can also find the 
formal Bayes rule for loss L q . (See Problem 40 on page 343.) 

5.3 The Bootstrap* 
5.3.1 The General Concept 

There are many situations in which it is very difficult to work out ana- 
lytically some feature of the distribution of some statistic in which we are 
interested. The idea of the bootstrap 1 ® is to suppose that a CDF F n calcu- 
lated from an observed sample X\ , . . . , X n is sufficiently like the unknown 
CDF F so that one can use a calculation performed using F n as an estimate 
of the calculation that we would like to perform using F. Two types of F n 
are commonly used. For the nonparametric bootstrap, F n is the empirical 



This section may be skipped without interrupting the flow of ideas. 
l0 For a good overview of bootstrap methodology, see Young (1994). 



330 Chapter 5. Estimation 



CDF of the data. For the parametric bootstrap, one assumes a parametric 
model (with each X* having CDF F Xl \e{-\0)) an d F n is F Xl \e{'\®n) for 
some estimate 0 n of 0. To be more precise, we follow Efron (1979, 1982). 
Let X = (X\, . . . , X n ) and let T be a space of CDFs in which we suppose 
that F lies. Let R : X x T IR be some function of interest. For example, 
R might be the difference between the sample median of X\ , . . . , X n and 
the median of F: 

K(X,F) = I[x (m) + X (m) ]-F-(I), 

where F~ l (q) is understood to mean inf{x : F(x) > q}, and is the fcth 
order statistic. The bootstrap replaces R(X,F) by i?(X*,F n ), where X* 
is an IID sample of size n from F n . If we are interested in the conditional 
mean of R(X, F) given P = P where P has CDF F, we try to calculate 
the mean of R(X*, F n ). The success or failure of the bootstrap will depend 
upon the extent to which F n is "like" F for the purposes of calculating the 
distribution of R. 

Example 5.80. Suppose that Xi,...,X n are conditionally IID £7(0,0) given 
G = 0. Here, F is the £7(0, 0) CDF. First, take R(X, F) = £Li Xi/n- f xdF{x). 
The mean of this quantity is 0. In fact, the mean of R(X, F) is 0 if F is any 
distribution with finite mean and X is an IID sample from F. In particular, the 
mean of K(X*, F„) given F n is 0. The parametric and nonparametric bootstraps 
do just fine here. 

Next, take R{X,F) = r^F^l) - X (n) )/ F~ l (l) y where X {n) is the largest 
coordinate of X. The distribution of X w /F- l (l) is Beta{n, 1) with CDF t n 
for 0 < t < 1. So, the CDF of i?(X,F) is 1 - (1 - */n) n . For large n, this 
is approximately 1 - exp(-t). For example, if t = 0.1, then Pe(R{X,F) > 
0.1) « exp(-O.l) = 0.905. On the other hand, for the nonparametric bootstrap, 
R(X\F n ) = n(X (n) - X(* n) )/X(n) and 

P,(H(X*,F n ) = 0|F n ) = l-(l-i) n , 

which is approximately 1 - exp(-l) = 0.632 for large n. The nonparametric 
bootstrap will perform poorly here. 1 

Why is the bootstrap good for the first half of Example 5.80 but not 
for the second half? In the first half, we are only interested in the mean 
of R(X,F), which is 0 no matter what F is. In the second half, virtually 
everything about the distribution of R(X, F) depends very much on F . 
Even the mean of R(X, F) is not the same for all F. For example, if F(x) 
has a density that drops to 0 at F _1 (l) like a power of F - x, then 



u The second part of this example was given by Bickel and Freedman (1981). 
See Problem 41 on page 343 to see how the parametric bootstrap performs in 
this example. 



5.3. The Bootstrap 331 



R(X,F) goes to oo with n. (See Theorem 7.32.) At the other extreme, if F 
has positive mass at (as does the empirical CDF), then R(X, F) has 

positive probability of equaling 0. 12 The success or failure of the bootstrap 
in individual problems depends on the degree to which F n approximates F 
for the specific purpose of calculating R(X,F). For some i?s, F n may be 
a wonderful approximation while for other Rs (even with the same data) 
F n is a miserable approximation. One needs to be careful, when using the 
bootstrap, not to assume automatically that it will be suitable for one's 
specific purpose without doing some checking first. 

In the nonparametric bootstrap, the distribution of R(X*,F n ) will usu- 
ally be a combinatorial nightmare, and Efron (1979) suggests using simu- 
lations to approximate features of its distribution. For example, if one is 
interested in the probability that \R(X, F)\ < 2, one can generate many 
samples from F n , X*' 1 , . . . , X*' m and calculate the proportion of times that 
|i?(X* J , F n )| < 2. Similarly, if one is interested in the mean of R(X,F), 
one can generate many X*> j and calculate the average of R(X*> j , F n ). For 
the parametric bootstrap, the distribution of R(X*, F n ) is generally of the 
same form as the distribution of R(X,F). Once again, simulation may be 
useful when this distribution is intractable. 13 

Example 5.81. Suppose that we are interested in the mean of R(X,F) — 
^(^1/2 — F~ 1 (l/2)) , where Y\fi is the median of a sample of size n from a 
distribution with CDF F. In Section 7.2 we will work out the theory for the 
asymptotic distributions of sample quantiles conditional on the CDF P = P 
when P has CDF F. But what if one does not know Fl According to the as- 
sumptions of the bootstrap procedure, if F n is sufficiently like F, we could sample 
n observations X* from the distribution F n and calculate the sample median Y^ 2 . 

We could then subtract the original observed sample median (equal to F~ 1 (l/2)) 
from Yi*^ square the result, and multiply by n. We could then repeat this many 
times and calculate the average of the squared values. In the case of the non- 
parametric bootstrap, if n is only moderate in size (a few hundred or less), exact 
calculation of the bootstrap distribution of R(X*,F n ) is possible using simple 
combinatorial arguments. In particular, if X(k) denotes the fcth order statistic of 
the original sample, then for odd n, 

(n-l)/2 n , v 

PrOfr = £ £ £ ( » £ )(*-!)'(» -*)"-'-, 

*=0 9 =(n+l)/2-* X 7 

(5.82) 



12 Bickel and Freedman (1981) claim that even if one were to smooth the boot- 
strap by sampling from a continuous approximation to the empirical CDF, there 
would still be a problem in the second half of Example 5.80. Singh (1981) and 
Bickel and Freedman (1981) prove some large sample properties of the nonpara- 
metric bootstrap as it pertains to estimating central (not extreme) quantiles of 
a distribution. 

13 See Section B.7 for some ideas on how to simulate in general. 



332 Chapter 5. Estimation 



Table 5.83. Summary of Bootstrap Results for Median 



Data Type Sample Size ER(X, F) Bootstrap Average RMS Error 



Laplace 


51 


1.242 


1.578 


0.949 


Laplace 


101 


1.162 


1.423 


0.637 


Normal 


51 


1.551 


1.898 


1.045 


Normal 


101 


1.561 


1.702 


0.924 


Uniform 


51 


0.241 


0.272 


0.156 


Uniform 


101 


0.245 


0.272 


0.115 



for k = 2, . . . , n - 1. For k = 1, n, we have 

We simulated 100 data sets of size n = 51 and another 100 data sets of size n = 
101 from the Lap(0, 1) distribution. Then we repeated the exercise with data from 
iV(0, 1) and [7(0, 1) distributions. We approximated the true mean of R(X, F) for 
each of the cases using a simulation of 100, 000 data sets (except for the uniform 
case in which the true value is just n times the variance of the Beta([n+ 1]/2, [n+ 
l]/2) distribution). The results are summarized in Table 5.83. The last column of 
Table 5.83 gives the square root of the average squared difference between the true 
value of ER(X, F) (third column) and the 100 simulated averages of R(X*, F n ). 
The fourth column gives the average of those 100 simulated averages. It appears 
that the bootstrap estimate (fourth column) is slightly high in all cases, but less 
so for larger sample sizes. The root mean squared error (last column) is very large 
compared to the true value, indicating that the bootstrap estimate of ER(X, F) 
based on a single data set may not be particularly useful. 

A Bayesian, faced with a difficult analytical problem, might also wish 
to resort to some form of computational procedure to replace the analysis. 
Rubin (1981) introduced a Bayesian bootstrap. This can be described as fol- 
lows. First, simulate a CDF F with Dirichlet process distribution Dir(F n ) 
(see Section 1.6.1). Second, simulate an IID sample from the CDF F and 
compute the observed value of whatever function is of interest. Repeat this 
pair of simulations as many times as desired to obtain a sample of values 
for the function of interest. This procedure [as Rubin (1981) noted] suffers 
from a flaw it has in common with the nonparametric bootstrap. The flaw 
is that the only data values ever simulated are the same ones that were 
originally observed. One never simulates from a distribution with support 
larger than the observed sample. Unless the sample is incredibly large, this 
can make quite unrealistic the assumption that the CDF from which the 
bootstrap samples are drawn is like the one from which the original data 
were generated. Who would ever argue that the observed values were the 
only ones that could have been observed, unless the distribution has known 
finite support? It might make sense instead to use one of the tailfree process 
priors from Section 1.6.2 concentrated on continuous distributions. 



5.3. The Bootstrap 333 



Example 5.84 (Continuation of Example 5.81; see page 331). To be as non- 
parametric as possible, suppose that we model the data {X n }5?Li as IID with 
distribution P conditional on P = P and we give P a tailfree process prior (see 
Section 1.6.2). We now observe X\ = xi,...,X n = x n - Suppose that we are 
interested in the mean of R(X',F) = (Yi/2 - F" 1 ^)) 2 , where X' is a future 
sample of size n from P, F is the CDF for P, and Yi/2 is the median of the 
observed sample. We could simulate a collection of distributions P\ , . . . , P m from 
the posterior distribution of P. For each P? (having CDF Fj), we could simulate 
an IID sample X* j of size n and find the sample median Y*J 2 . We also would need 
to find the median of Pj (call it F~ 1 (l/2)) and then average {Y*j 2 - F" 1 (l/2)) 2 . 

As an example, we might use the Bayesian bootstrap, which corresponds to a 
tailfree prior with improper prior distribution as well as to a Dirichlet process 
with improper prior. To simulate F, we need only simulate a vector T with 
.Dir n (l, . . . , 1) distribution, and let F(x) be the sum of T\ for those i such that 
< x. Then, the exact distribution of the median of a sample of size n drawn 
from the distribution F can be computed as in (5.82). For instance, for k = 
2, . . . , n — 1 and odd n, we have 

n-l 

wi=. W )=i t L»v ff y<■^ < ~ , ' (5 - 85) 

where T<,fc = ^ anc * T>,fc = SILfc+i Using the same data sets as in 

Example 5.81 on page 331, we obtained the results summarized in Table 5.86. 
The Bayesian bootstrap seems to estimate ER(X, F) to be even higher than does 
the bootstrap. This seems natural due to the additional variance introduced in 
the Bayesian bootstrap. 

The particular choice of R(X, F) in Example 5.84 was chosen to match 
the choice from Example 5.81. From a Bayesian viewpoint, however, a more 
interesting use of the bootstrap technology might be to try to predict the 
median of a future sample. In this case, we could use the bootstrap (or 
Bayesian bootstrap) distribution of the median of the sample X* as a pre- 
dictive distribution for the median of a future sample. In fact, the Bayesian 
bootstrap distribution of the median of the sample X* is precisely the pre- 
dictive distribution of the median of a future sample if F is modeled as 
a Dirichlet process with improper prior. Similarly, following the bootstrap 
logic, if F n is sufficiently like F, then the median of a sample drawn from 



Table 5.86. Summary of Bayesian Bootstrap Results for Median 
Data Type Sample Size ER( X, F) Bootstrap Average RMS Error 



Laplace 


51 


1.242 


1.988 


1.151 


Laplace 


101 


1.162 


1.638 


0.735 


Normal 


51 


1.551 


2.183 


1.117 


Normal 


101 


1.561 


1.916 


0.864 


Uniform 


51 


0.241 


0.308 


0.153 


Uniform 


101 


0.245 


0.306 


0.121 



334 Chapter 5. Estimation 




~~i 1 1 1 1 1 — 

0.0 0.2 0.4 0.6 0.8 1.0 



True Distribution 

Figure 5.87. Bootstrap and Bayesian Bootstrap Distributions of Sample Median 



F n should have a distribution like that of the median of a sample drawn 
from F. 

Example 5.88 (Continuation of Example 5.84; see page 333). We can use the 
bootstrap distribution of the median in (5.82) or the mean of the conditional 
Bayesian bootstrap distribution given T in (5.85) as predictive distributions for 
the medians of future samples. The mean of (5.85) is easy to compute in closed 
form since (T <jk ,T k ,T >ik ) has Dir 3 {k - 1,1, n - k) distribution. To see how 
closely these distributions approximate the distribution of the median of a future 
sample, we note that (for odd n) if data X comes from a distribution F, then 
the median has the same distribution as F~ l (U), where U ~ Beta([n + 1]/2, [n-f 
l]/2). Hence, we will only simulate data from the C/(0, 1) distribution in order to 
compare the two bootstrap distributions to the true distribution of the sample 
median. We will do this by calculating (for k = 1, . . . , n) the probability that the 
median of a future sample lies below X( k) and comparing this to (5.82) and the 
mean of (5.85). Figure 5.87 shows plots of the bootstrap and Bayesian bootstrap 
distributions of a future median against the true distribution for sample sizes 
of 51 and 101. The bootstrap consistently assigns too little probability to both 
the upper and lower tails of the distribution. The Bayesian bootstrap, however, 
reproduces the true distribution (the diagonal line in Figure 5.87) remarkably well 
due to the additional variance it adds to the predictive distribution compared to 
the bootstrap. 

One might be tempted to use tailfree priors or the Bayesian bootstrap 
in the second part of Example 5.80 to try to overcome the problems the 
nonparametric bootstrap had. 14 It is clear that the Bayesian bootstrap 
will fare no better than the nonparametric bootstrap for nearly the same 



14 This example is examined in detail by Schervish (1994). 



5.3. The Bootstrap 



335 



reason. Trying to use a tailfree process (like a Polya tree distribution) 
quickly leads to the realization that the problem as described cannot be 
solved satisfactorily without further modeling assumptions. For example, 
the typical P with Polya tree distribution on an interval [0, 0] will, with 
high posterior probability, have a density very close to zero on an interval 
[a, 0] if all of the observed data are less than or equal to a. The distribution 
of n(G - X( n ^)/Q is likely to be concentrated on very large values because 
the probability that X ( * n) < a will be quite high. This suggests that one 
might wish to restrict attention to distributions whose densities stay above 
a certain level near 0. Alternatively, one might wish to replace F~ l (l) with 
F~*(l - e) for some small e. There are any number of possible alternative 
formulations of the problem. One should give serious consideration to what 
one really wants to know before choosing a procedure that may not solve 
the problem of interest. 

5.3.2 Standard Deviations and Bias 

The bootstrap was originally [see Efron (1979)] designed as a tool for esti- 
mating the bias and standard error of a statistic. 

Example 5.89. Suppose that conditional on P = P, Xi, . . . ,X n are condition- 
ally IID with CDF F, where F is the CDF of distribution P. We assume only 
that / x 2 dF{x) < oo. Let 



and suppose' that we are interested in the mean of R. This is the bias of the 
square of the sample average as an estimator of the square of the mean. If the 
observed sample average is x n and we use si = J^LiO** -x n ) 2 /n as an estimate 
of variance, then 



The mean of this, given the data, is sl/n. The mean of R(X,F) given P = P 
is <r 2 /n, where a 2 = f(x - fj,) 2 dF(x). Since si is supposed to be close to a 2 for 
large n, the bootstrap is thought to behave well in this case. If, instead, we had 
used 



How would one use the fact that R'{X,F) has mean 1/n to "correct" X 2 n as 

an estimator of (/ xdF(x)) ? Presumably, one would subtract sl/n. Although 

this would usually do well, if sl/n happens to be larger than X^, one would get 
a very silly result. 






336 Chapter 5. Estimation 



Similarly, if one were interested in the standard deviation of X , one could 
assume that J x 4 dF(x) < oo, and apply the bootstrap to 

dF(xi)...dF(z n )^ • (5.90) 

The bootstrap estimate of standard deviation would be the square root of the 
average of the values of R(X*> F n ). Unfortunately, the term after the minus sign 
in (5.90) is not easy to evaluate even with F = F n . An obvious alternative is to 
use the average of the values of (£ t n =1 X* /n) . 

The bootstrap can be applied to all types of statistics whose means 
and/or variances are difficult to calculate analytically. Efron and Tibshirani 
(1993) give many examples of such statistics, like correlations, regression 
coefficients, and nonlinear functions of such things, whose sampling distri- 
butions are nontrivial but whose bootstrap distributions are very straight- 
forward. (See some of the problems at the end of this chapter.) 



n 



5.3.3 Bootstrap Confidence Intervals 

Suppose that we desire a confidence interval for some function g of a 
parameter 0. From the bootstrap perspective, it is preferable to write 
#(9) as h(F). We might then desire a confidence interval of the form 
(-oo, h{F n ) + Y] or [h{F n ) - Y u h(F n ) + Y 2 ]. The problem is to find Y, or 
Y\ and Y 2 . For the one-sided case, in order for the interval to be a coef- 
ficient 7 confidence interval, it must be that P e (h(F n ) + Y > h(F)) = 7. 
Equivalently, we need P e (h(F) - h{F n ) < Y) = 7- There might be avail- 
able a formula a 2 (F) for the approximate variance of h(F n ). For example, 
if h(F) = JxdF(x) y then <r 2 (F) = J[x - h{F)} 2 dF{x)/^. In this case, 
one might replace Y by a(F)Y' or by a(F n )Y'. Suppose that we do the 
latter. 15 Then we want Y' to satisfy 

p (KF)-h{F n )\ =i (591) 

V a ( F n) J 

This makes Y' equal to the 7 quantile of R(X,F), where R is the func- 
tion on the left of the inequality in (5.91). What is commonly called the 
percentile-t bootstrap confidence interval for h(F) would be 

(-oo,h(F n ) + a(F n )Yl (5.92) 



15 The reason for switching from Y to <r(F n )Y' is that one would suspect 
Y' depends less on the underlying distribution than does Y. 



5.3. The Bootstrap 337 



where Y' is an estimate of Y' obtained by simulation. One could simulate 
XJ 5 , . . . , X% with distribution F n and calculate, for i = 1, . . . , 6, 

a(Fi) 

where Ft is the empirical CDF of the bootstrap sample X*. The sample 
7 quantile of the R(X* ) F n ) values could then serve for Y' in (5.92). Hall 
(1992) examines bootstrap confidence intervals in detail and finds asymp- 
totic expressions for their actual coverage probabilities. 

Example 5.93. Consider the same data used in Example 1.132 on page 71. The 
data were a sample of size n = 50 from a Laplace distribution Lap(l, 1). We are 
interested in h(F) = f xdF(x), the mean of F. We simulated 10,000 bootstrap 

samples, Xf, . . . , XZ 0000 , and for each one, we calculated R(Xf, F50), where F50 
is the empirical CDF of the 50 observations. The curve labeled "Bootstrap" in 
Figure 5.94 shows the 10, 000 values ofX 5 o+i?P*7, Ao)5 5 o, where S50 = cr(F 50 ), 
the MLE of the standard deviation of X50. As an example of confidence intervals, 
one-sided 95% lower and upper bound confidence intervals for h(F) based on the 
bootstrap are [0.5735, 00) and (-00, 1.3199], respectively. 

Using the same prior distribution as in Example 1.132, we also used a tailfree 
process of Polya tree type to simulate 10, 000 values of the mean of the distri- 
bution P. The empirical CDF of these values is plotted as the curve labeled 
"Tailfree" in Figure 5.94. Not surprisingly, we see that the extreme quantiles of 
the tailfree sample are farther from the sample mean than those of the bootstrap 
sample. This is due to the fact that the bootstrap procedure ignores the uncer- 
tainty from not knowing F when calculating R values. That is, we must pretend 




h 1 1 1 r 

-10 12 3 



Figure 5.94. Distributions of Sampled Bootstrap and Tailfree Quantiles 



338 Chapter 5. Estimation 



that F = F50 when calculating the R values. The Bayesian solution takes into ac- 
count the additional uncertainty from not knowing F. The 0.95 upper and lower 
bound posterior probability intervals corresponding to the confidence intervals 
calculated above are [0.2574, 00) and (-00, 1.7628], respectively. 

Hall (1992) shows that the percentile-t confidence interval has good fre- 
quency properties in terms of the conditional probability that the interval 
covers h(F) given F. With a single sample, as in Example 5.93, frequency 
properties are neither apparent nor relevant. Suppose, however, that one 
were to use bootstrap confidence intervals in many applications (with many 
different Fs). Prom the classical perspective, one might actually be inter- 
ested in the proportion of times that the interval covers h(F) and how this 
compares to the nominal confidence coefficient. 

Example 5.95. Suppose that we will sample data from several different Lap(/x, a) 
distributions on several different occasions but we do not model the data this way. 
Rather, suppose that we use the bootstrap and a tailfree prior of Polya tree type 
as in Example 5.93 on page 337. As an example, 1000 data sets of size 50 each 
were simulated with many different Laplace distributions. The values of a were 
generated as r -1 (l, 1) random variables and the locations were a times iV(0, 1) 
random variables. Location and scale changes do not_affect the calculation of R 
in the bootstrap, but they do affect the variance of X and S. For each data set, 
1000 bootstrap samples were formed and 1000 observations from the posterior 
mean of J xdP(x) were simulated. We counted how many times n was below 
each of the 1000 sample quantiles of the simulated values. Figure 5.96 shows 
these proportions for both the bootstrap and Polya tree samples. As expected, 
the bootstrap proportions match the nominal significance levels well. 




5.4. Problems 339 



One final note is in order concerning the bootstrap. All you ever learn 
about by using the bootstrap, without further modeling assumptions, are 
properties of F n . Unless you have a way of saying how much and/or in 
what ways knowledge of F n can be^ transformed into knowledge of P, the 
bootstrap can only tell you about F n , not about P. 



5.4 Problems 

Section 5.1.1: 

1. Prove Proposition 5.8 on page 298. 

2. Let Xi,X 2 , X3 be IID given 0 = 9 with Exp(0) distribution (mean =1/6). 
Find the UMVUE of g{0) = 1 - exp(-xG). (Hint: Use the Rao-Blackwell 
theorem 3.22.) 

3. Suppose that Xu . . . , X n are IID given G = 9 with Exp{\/9) distribution. 
Find the UMVUE of 6 and show that it is inadmissible with squared-error 
loss. 

4. Suppose that X ~ 7V(0, 1) given G = 0. 

(a) Find the UMVUE of G 2 . What is wrong with this estimator? 

(b) Suppose that we have a decision problem with loss function L(0, a) = 
(9 - a) . Find the generalized Bayes rule with respect to Lebesgue 
measure. Show that this estimator is inadmissible. 

5. Suppose that X\, ...,X n are conditionally IID with Ber{9) distribution 
given G = 9. Find the UMVUE of G(l - G). 

6. Suppose that X u . . . , X 20 are conditionally IID given 0 = 0 with N{9 1) 
distribution. We first collect X u . . . ,X l0 and compute Y = f^i Xt/lO 
If Y < 5, we set Z = Y, N = 10 and stop sampling. If Y > 5, we collect the 
other 10 observations and compute Z = £ 2 ° Xi/20. We then set N = 20. 
The data we report are (JV, K, Z). 

(a) Prove that V is an unbiased estimator of G. 

(b) Prove that Z is biased and find its bias. 

(c) Show that N is not ancillary. 

7 * h e \Yt 7, 1 = lj ' ' ' ' 71 be P airs of random variables, and let G = 

(A,M). Suppose that 

/v 4 |e(a,y|A,f£) = A/xexp(-Ax - /xy), for a: > 0 y > 0. 
All we get to observe, for i = 1, . . . , n , are 

(a) Prove that Z» is conditionally independent of Ui given G. 



340 



Chapter 5. Estimation 



(b) Find a complete sufficient statistic. 

(c) Find a UMVUE of A. 

8. Return to the situation of Problem 16 on page 140. Find the UMVUE of 
0. 

9. Return to the situation of Problem 9 on page 138. 

(a) Find the conditional UMVUE of 0 given Xi. 

(b) Find an estimator with smaller unconditional variance than the con- 
ditional UMVUE. 

(c) Show that (AT, M) is not a complete sufficient statistic. 

10. Let Xi, . . . , X n be conditionally IID with AT(0, 1) distribution given 0 = 0. 
Determine the UMVUE of g(S) = Pr(|Xi| < c|0), where c> 0 is fixed. 

11. Suppose that Xi, . . . , X n are conditionally IID given 0 = 0 with condi- 
tional density 

0 a a 

fx\e(x\0) = ^+t^,oo)(«), 
where a is known and the parameter space is Q = (0, oo). 

(a) Find a complete sufficient statistic. 

(b) Find a UMVUE of 0. 

(c) Prove that the UMVUE is inadmissible if the loss function is squared 
error. 

12. Let Q be the set of all integers, and suppose that X given 0 = 0 has 
discrete uniform distribution on the set {9 - 1, 9, 9 + 1}. Let g : CI — ♦ IR be 
a nonconstant function. Show that there is no UMVUE of #(0). 

13. An alternative mode of estimation is the method of moments. Let X — 
(Xi, . . . , X n ) with the Xi being IID given 0. Suppose that /x fc = E 9 (Xf) 
is finite for k = 1, . . . ,ra. Also suppose that there is a function h such 
that g(0) = h(fi U . . . , Mm). Let Y k = X*/n. Then h(Y u ..., Y m ) is a 
method of moments estimator of g(S). Find method of moments estimators 
for each of the following situations: 

(a) Xi - Exp(9) given 0 = 0, g(9) = 0; 

(b) Xi - Af(/x,<r 2 ) given 0 = (/i,<r), ^(0) = <r 2 ; 

(c) Xi - £er(0) given 0 = 0, g(0) = 0(1 - 0)- 

Section 5.1.2: 

14. Suppose that says that X - Poi(0) and that we are trying to estimate 
exp(-30). 

(a) Find the Cramer-Rao lower bound for unbiased estimators. 

(b) If 4>{X) = (-2)*, find Var*0(X). 

(c) Find both the Cramer-Rao lower bound and the Chapman-Robbins 
lower bound for the variance of unbiased estimators of 0. Which is 
larger? 



5.4. Problems 341 



15. Suppose that Xi, . . . , X n are conditionally IID Poi(9) given 0 = 0. 

(a) Let r be a known integer. Find the UMVUE of exp(-0)0 r . 

(b) Let n = 1 in part (a). Find the variance of the estimator and the 
Cramer-Rao lower bound. 

16. Suppose that Y has Exp(l) distribution and is independent of 0. Suppose 
that Xi, . . . , X n are IID conditional on Y = y and 0 = 9 with N(0, l/y) 
distribution. We get to observe Y and Xi, . . . , X n . 

(a) Find the Cramer-Rao lower bound for the variance of unbiased esti- 
mators of 0. 

(b) Show that no unbiased estimator achieves the Cramer-Rao lower 
bound. 

(c) Explain why the Cramer-Rao lower bound should not be taken seri- 
ously in the problem described here. 

17. For the location family of t a distributions in Example 5.17 on page 303, 
prove that the Bhattacharyya lower bound with k — 2 is the same as the 
Cramer-Rao lower bound. 

18. Let X ~ [7(0, 9) given 0 = 9. 

(a) Find the Chapman-Robbins lower bound on the variance of an unbi- 
ased estimator of 0. 

(b) Find an unbiased estimator and find by how much its variance exceeds 
the lower bound. 

19. In Example 5.19 on page 304, prove that minX* - 1/n is the UMVUE. 

20. Refer to the situation in Example 2.83 on page 112. 

(a) Prove that Cov*(V>i, V>2) = 0. 

(b) Prove that Var^(X) = 2a A + 4/z 2 a 2 . 
Section 5.1.3: 

21. Consider the situation in Problem 6 on page 339. Find the MLE of 0. 

22. Find the maximum likelihood estimator of 0 if X u . . . , X n are condition- 
ally IID with distribution Exp(9) given 0 = 9. 

23. Suppose that X ~ Cau{9, 1) given 0 = 9. Find the MLE of 0. 

24. Suppose that X u . . . , X n are IID with Laplace distribution Lap(^ a) given 
0 = (/i,<t). / 

(a) Prove that the value of fi that minimizes J^=i I** " Ml is the median 
of the numbers xi , . . . , x n . 

(b) Find the MLE of 0. 

25. Let X have a one-parameter exponential family distribution given 0. Sup- 
pose that the MLE of 0 is interior to the parameter space. Prove that the 
MLE equals a method of moments estimator. (See Problem 13 on page 340.) 



342 Chapter 5. Estimation 



26. Consider the situation in Example 5.30 on page 308. Let squared error be 
the loss, that is, L(0,a) = (/i 2 - a) 2 . Show that the UMVUE dominates 
the MLE if n > 2. Find a formula for the difference in the risk functions. 
Also, find an estimator that dominates both the MLE and the UMVUE. 

Section 5.1.4: 



27. Consider the situation in Problem 6 on page 339. Find the likelihood func- 
tion and show that a Bayesian would take the data at face value (that is, a 
Bayesian would calculate the same posterior as if the sample size had been 
fixed in advance at whatever value N turns out to be, no matter what the 
prior is). 

28. Consider the situation in Problem 7 on page 339. If A and M are indepen- 
dent a priori with A ~ T(a, b) and M ~ T(c,d), find the posterior mean of 
A given the data. 

29. In Example 5.10 on page 299, find the posterior mean of 0 given X — x 
for all x assuming that the prior for G is £e£a(ao, A>). 

30. Return to the situation of Problem 16 on page 140. If 0 has a prior density 
/ e (0) = ac a / [c>oo) (0)/0 a+1 , find the posterior mean of 0. 

31. *Suppose that, conditional on N, {Xi}^ are independent with the first 

N of them having Ber(l/3) distribution and the rest having Ber(2/3) 
distribution. The prior for N is /jv(n) = 2~ n for n = 1, 2, . . .. 

(a) Find the posterior distribution of N given a finite sample X\ , . . . , X n , 
for known n. 

(b) If Xi = 0, . . . , X n = 0 is the observed finite sample, find the posterior 
mean of N. 



Section 5.1.5: 



32. Let Vo be the set of distributions on (H, B 1 ) with finite variance. Let T{P) 
be the standard deviation of the distribution P. Show that IF{x\ T, P) = 
(a; - fi) 2 /[2(r] - a/2, where /x is the mean of P. 

33. Let Vo be the class of distributions on (1R, B l ) with bounded support, and 
let T( P) be the supremum of the support. Prove that the influence function 
for T is IF(x; T, P) = 0 if x < T{P) and oo if x > T(p). 

34. Find the influence function for the 100a% trimmed mean at a continuous 
distribution P. 



Section 5.2: 



35. Prove Proposition 5.48 on page 316. 

36. Let the parameter space be Q with a-field of subsets r. Let X be a random 
quantity taking values in a set X, and let X have conditional density f X \e 
given 0. Let v : X -> [0, oo) be a measurable function such that the set 

L(x) = {6eQ: fx\e(x\0) > v(x)} 



5.4. Problems 343 



is in r for all x. Let B have a prior distribution /ie, and let \ix denote the 
prior predictive distribution of X. Let C : X — ► r be another set function 
such that ^e{C(x)) < /xe(£(x)), a.s. [/zx]. Prove that Pr(6 E C(x)|X = 
x) < Pr(0 G L(a:)|X = x), a.s. [//*]. 

37. Prove Proposition 5.56 on page 319. 

38. Prove Proposition 5.61 on page 321. 

39. Prove Proposition 5.79 on page 329. 

40. Suppose that f2 = 1R and that the posterior density of 0 given X is strongly 
unimodal. Let the action space be the set of all closed and bounded intervals 
[ai,a2] in IR. 

(a) Let the loss function be Li{0, [ai,a2]) = ai — a\ + c (l — J[ai,a 2 ](0))- 
Prove that the formal Bayes rule is an HPD region. 

(b) Let the loss function be L q (^, [ai, 02]) = (02— ai) 2 -f c (l - J[ai,a 2 ]W)- 
Find the formal Bayes rule. 

Section 5.3: 

41. Suppose that one wished to construct a parametric bootstrap estimate in 
the second part of Example 5.80 on page 330. 

(a) Explain how to construct the parametric bootstrap estimate using 
the U (0, 0) parametric family. 

(b) Find the distribution of R(X* y F n ) for the parametric bootstrap es- 
timate. 

(c) Will the parametric bootstrap estimate have the same problem that 
the nonparametric bootstrap estimate has? 

42. How would one use the nonparametric bootstrap to find the bias and stan- 
dard deviation for the sample correlation coefficient from a sample of n 
pairs (Xi, *),..., (Xn.KO? 

43. Let (xi, Y\ ),..., (x n ,y n ) be data pairs, and suppose that we entertain a 
regression model in which E(Yi|Bo = 0o,Bi = 0\) = 0o + p\Xi. The y- 
intercept of the regression line is x = —/3o/f3i. Let (Bo,Bi) be the usual 
least-squares regression estimator. 

(a) How would one use the bootstrap to find the bias and standard de- 
viation of the ratio — B0/B1? 

(b) Suppose that one used the following formula for the approximate 
variance of the ratio of two random variables Zq/Z\: 

Var(Zp) ZgVar(Zi) _ o Z 0 Cov(Z 0 ,Zi) 
Z\ + Z\ Z\ 

Show how you would use this to find bootstrap confidence intervals 
for -B0/B1. 



Chapter 6 
Equivariance* 



In Chapter 3, we introduced a few principles of classical decision theory 
(e.g., minimaxity, ancillarity) to help to choose among admissible rules. 
In Chapter 5, we introduced another principle called unbiasedness which 
could be used to select a subset of the class of all estimators. As we saw, 
sometimes none of the unbiased estimators was admissible. In this chapter, 
we introduce another ad hoc principle called equivariance, 1 which can also 
be used to select a subset of the class of all estimators. The principle of 
equivariance, in its most general form, relies on the algebraic theory of 
groups. However, the basic concept can be understood by means of a simple 
class of problems in which the principle can apply. 



6.1 Common Examples 
6.1.1 Location Problems 

We will consider only parametric aspects of equivariance, since it is of inter- 
est primarily in the classical paradigm. Suppose that we have constructed 
a parametric family Vq with a parameter 6. 

Definition 6.1. First, let X and 6 be scalar random variables. If the 
conditional distribution of X - 6 given 9 = 6 is the same for all 0, then 



This chapter may be skipped without interrupting the flow of ideas. 

^ome authors call the principle invariance rather than equivariance. We will 
see later why the term invariance is better used to mean something different, but 
related. 



6.1. Common Examples 345 



0 is called a location parameter for X. If 6 > 0, and the conditional 
distribution of X/Q given 6 = 9 is the same for all 0, then 0 is called a 
scale parameter for X. If 0 = (0i, 0 2 ), where both 0* are scalar, ©2 > 0, 
and the conditional distribution of (X - 0i)/©2 given 0 = 9 is the same 
for all 0, then 0 is called a location- scale parameter for X. 

Next, let X be a vector, and let 0 be a scalar. Let 1 denote the vector 
of the same length as X with every coordinate equal to 1. Then 0 is a 
location parameter for X if the conditional distribution of X - 01 given 
0 = 9 is the same for all 9. If 0 > 0 and the conditional distribution of 
X/Q given © = 9 is the same for all 0, then 0 is a scale parameter for X. 
If © = (©1,62), where both ©j are scalar, © 2 > 0, and the conditional 
distribution of (X — @il)/©2 given 0 = 9 is the same for all 9, then 0 is 
called a location-scale parameter for X. 

Next, let A" be a vector, and let 0 be a vector of the same dimension. 
Then 0 is a location parameter for X if the conditional distribution of 
X — © given 0 = 9 is the same for all 9. If © is a nonsingular matrix 
parameter and the conditional distribution of Q~ l X given 0 = 9 is the 
same for all 9, then 0 is a scale parameter for X. If 0 = (0i, ©2), where 
©i is a vector of the same dimension as X, ©2 is a nonsingular matrix, and 
the conditional distribution of ©2 1 {X - ©1) given © = 9 is the same for 
all 9, then © is called a location-scale parameter. 

We will deal only with location parameters in this section. In fact, the 
only cases of location parameters we will consider are those in which X is 
a vector of exchangeable random variables that are conditionally IID given 
0, and © is scalar. That is, the conditional distribution of X — ©1 given 
0 = 9 is the same for all 9. 

Theorem 6.2. If Q is a location parameter for X and fx\e( x \0) is the 
Radon-Nikodym derivative of Pq with respect to Lebesgue measure, then 
fx\e( x \9) = g(x — 91) for some density function g. 

Proof. The conditional joint CDF of X - © given 0 = 9 is 
Pg(Xi - 0 < Ci, for i = 1, . . . , n) 

/ci+0 rc n +0 
- f X \e{x\9)dx n -'dxi 
-00 J —00 

= / 1 • • • / " fx\e(y + Ol\0)dy n ■ --dyu 

J — 00 J — OO 

which, for each ci, . . . , c n , is the same for all 9 if and only if fx\e(y+Ol\0) = 
g(y) for some density g. This implies that fx\ei x \Q) = 9( x — D 
Next, imagine two possible data vectors x and y = x + cl. If 0 + c G fi, 
then fx\e( x \9) = fx\e(y\9 + c). This says that if the data values were all 
shifted by the same amount, then the likelihood function would be trans- 
lated by that same amount. The goal of equivariance is to take advantage 
of this "double shift." We make this idea more precise in a proposition. 



346 Chapter 6. Equi variance 



Proposition 6.3. IfQ is a location parameter for X , then the conditional 
distribution ofX+cl given 0 = 0 is the same as the conditional distribution 
of X given 0 = 0 + c. 

The word "equivariant" means that two (or more) things change in the 
same way. Proposition 6.3 says that two different changes to a problem 
produce the same change to a distribution. That is, the conditional distri- 
bution of X given 0 changes the same way whether we change X to X + cl 
or we change 0 to 0 -f c. 

Definition 6.4. Consider a decision problem with parameter space H and 
action space IR. A loss function L is called location invariant if L(0, a) = 
p(0 — a), where p is some function. If p increases as its argument moves 
away from 0, such a decision problem is called location estimation. 

A decision rule 6(x) is location equivariant if 6(x + cl) = 6(x) + c, for all 
c and all x. A function g(x) is location invariant if g(x + cl) = g(x), for all 
c and all x. 

The word "invariant" means "does not change." Functions that satisfy 
g(x + cl) = g(x) have the property that their value does not change when 
their argument changes. Functions that satisfy 6(x + cl) = 6(x) 4- c have 
the property that their value changes the same way whether we change x 
to x H- cl or we change 6(-) to 6(-) + c. 

Proposition 6.5. // 6 is location equivariant and 0 is a location parame- 
ter, then the conditional distribution of 6(X) — 0 given 0 = 0 is the same 
for all 0. 

Note that Proposition 6.5 implies that the risk function of an equivariant 
estimator is constant if the loss is location invariant. (See Problem 3 on 
page 388.) 

Lemma 6.6. 2 Suppose that 6q is location equivariant. Then 6\ is location 
equivariant if and only if there exists a location invariant function u such 
that 8\ = 8q + u. 

Proof. Clearly, if there is such a u, then 6 0 + u is equivariant. Equally 
clearly, if 8\ is equivariant, then u = 6\ - 6o is invariant. □ 

Lemma 6.7. A function u is location invariant if and only ifu is a function 
of x only through (x\ — x n , . . . , x n -\ — x n ). 

Proof. The "if" part is trivial. For the "only if" part, let c = -x n . Then 

U(X -h Cl) = U(X X -X ny ..., X n -i - X n , 0) = u(x), 

by invariance. Note that when n = 1, only constants are invariant. □ 



2 This lemma is used in the proofs of Theorems 6.8 and 6.18. 



6.1. Common Examples 347 



Theorem 6.8. Suppose that Y = (X\-X n , . . . , X n -i -X n ) and L(0, a) = 
(0 — a) 2 . Suppose that <5o is a location equivariant estimator with finite risk. 
Then, the equivariant estimator with smallest risk is 6o(X) — Eo[<$o(^OI^]- 

Proof. Let <5o be an arbitrary equivariant estimator with finite risk. By 
Lemma 6.6, all other equivariant estimators have the form 6o(X) — v(Y). 
Since the risk function is constant for an equivariant 6, 

11(0,6) = R(0,6) = E 0 [6 0 (X)-v(Y)] 2 
= E 0 {Eo[(6o(X)-v(Y)) 2 \Y}}, 

which is minimized by minimizing Eo[(6o{X) — V {Y)) 2 \Y = y] uniformly in 
y. This is accomplished by choosing v(y) = Eo[<$o(^OI^ = y]- n 
The estimator in Theorem 6.8 is due to Pitman (1939). It is often called 
the minimum risk equivariant (MRE) estimator or Pitman's estimator. 
Throughout the rest of this section, we will use the symbol Y to stand 
for the vector defined in Theorem 6.8. 

Example 6.9. Let X x , . . . , X n be IID N{9, 1) given 6 = 0. Let 

Yj = (Y T , X n ) = (Kl, . . . , Y n -i,Y n ). 
Let So(X) — X n — Y n . We can write 



y* = 



o o 
\ o o 



-l 



-l 
i / 



X. 



Hence, given 0 = 0, X ~ JV„(01, I n ) and 



Y.~N n 



( o \ 

0 



/ 2 1 1 -1 \ 



1 
1 

\ - 1 



1 

2 
-1 



-1 
-1 



1 / 



To get the conditional distribution of X n given Y, we need the inverse of the 
upper left-hand corner of the covariance matrix of Y*. This inverse is I n — J/n ) 
where J is a matrix of all Is. So 

X n \Y = y,e = 6~N(o + [-1, . . . , -1] (l n - i j) y, c) , 

for some number c which we do not need, since the minimum risk equivariant 
estimator depends only on the mean of this distribution when 0 = 0. We can 
rewrite the mean as 

n— 1 ^ n— 1 ^ n— 1 

i=l 1=1 1=1 

Pitman's estimator is then X n 4- X — X n = X, which is not surprising. 



348 Chapter 6. Equi variance 



Pitman's estimator is often expressed in a different form. 

Theorem 6.10. In a location problem with one- dimensional parameter, 
Pitman's estimator can be written as E(Q\X = x), where we used a "uni- 
form prior" for 0. That is, the MRE estimator for squared-error loss is the 
formal Bayes rule with respect to Lebesgue measure, if it has finite risk. 

Proof. Suppose that Y = (Xi - X n , . . . ,X n - X - X n ) and /x|e(x|0) = 
g(x-6l). Transform X to (Y T ,X n ) T . The Jacobian of this transformation 
is 1, and we get 

fY,x n \e(y,Xn\0) = 9(y + (x n - 0)l,x n - 0). 
The marginal density of Y is 

fY\e{y\0) = J 9(y + {x n -0)l,x n -0)dx n = j g(y + ul,u)du. 

This does not depend on 0 because Y is ancillary (see Problem 4 on 
page 389). So, we can write the conditional density 

f / | a\ g(y + x n l ,x n ) 

Let 6o(^) = in Theorem 6.8. Then 

/x W/V iv \ fug(y + ul,u)du 
v(y) = MX n \Y = y)= /g(y + ttl>tt)du • 

Now change variables from u to z = x n - u, so = x n - z. Then + u = 
Xi — 2, and 

«M - x -^) = ^ (x "~ u)g(y + Ul ' u)d " 

Jzg(s-zl)<fe Jg/x|e(g|g)<» = Emx = x) 
- Jg(x-zl)dz Jfx\e(x\e)de 

if the prior for 0 is Lebesgue measure. D 
The "uniform prior," or Lebesgue measure is closely related to location 
equivariance. The relation is that the Lebesgue measure of a set is invariant 
under location shifts. When we deal with more general types of equivari- 
ance, a generalization of Theorem 6.10 will emerge in that the MRE esti- 
mator will be the formal Bayes rule with respect to a "prior" distribution 
that is invariant in the appropriate sense. 

Example 6.11 (Continuation of Example 6.9; see page 347). Let Xi , . . . , X n be 
IID N(6, 1) given 6 = 6. If X = (* x , . . . ,X n ) and 6 0 (X) = X n , the posterior 
from a uniform prior is 0 ~ N(x, 1/n), and the MRE is X, as we saw earlier. 



6.1. Common Examples 349 



Example 6.12. Let X have Cauchy distribution Cau(0, 1) given 6 = 0. The 
posterior from a uniform prior is 0 ~ Cau(x, 1), and there is no formal Bayes 
rule. In fact, it is easy to show that all equivariant estimators have infinite risk. 
Note that constant estimators (like 6(x) = c G Q for all x) have finite risk but 
are not equivariant (and they have infinite posterior risk, but since the prior is 
improper, this is not surprising). 

A maximin style theorem can be proven about equivariant estimation. 
Suppose that Nature knows that you will use an equivariant estimator and 
the loss function is squared error. Which location family should Nature 
choose to make your risk as large as possible? The answer is the family of 
normal distributions. 

Theorem 6.13. Suppose that L(Q,a) = (6 — a) 2 and 

T = {/ > 0 : j f(x)dx = 1,J xf{x)dx = 0,J x 2 f(x)dx = 1 J . 

Suppose that X\,... ,X n are IID with conditional density f(x — 6) given 
0 = 6 for some f e T . Define r n (f) to be the greatest lower bound of the 
risk over the set of all equivariant estimators. Then supy e ^-r n (/) = r n (/o), 
where fo is the standard normal density. 

Proof. If jfo is the standard normal density, then X is MRE and r n (/o) = 
l/n. Since X js always equivariant, it must be that r n (f) < 1/n, since l/n 
is the risk of X for all / e T? □ 

Example 6.14. Suppose that . . . , X n are IID with conditional density f(x — 
0) given G = 9 where f(x) — exp[~(x + 1)] for x > —1. This is a family of shifted 
exponential distributions, rigged so that 0 = 0 has mean 0 also. The first-order 
statistic X(i) = minXi is equivariant and is a complete sufficient statistic. Since 
Y is ancillary, Theorem 2.48 says that X(i) and Y are independent given 0. So 

Eo(X (1) |y) = E 0 (X (1) ) = i-l, 

and X(i) + (n - l)/n is MRE and its risk is l/n 2 . 

For more general loss functions, we have the following lemma. 

Lemma 6.15. If p is strictly convex and not monotone and L(0,a) = 
p(a — 9) y then the MRE estimator of 0 exists if and only if there is some 
equivariant estimator with finite risk. The MRE is unique. The MRE is 
unbiased if p(t) — t 2 . 



3 There js a theorem of Kagan, Linnik, and Rao (1965) which shows that, for 
n > 3, E 0 (X|Y) = 0 if and only if / = f 0 . This would imply that r n (f) < l/n if 
n > 3 and / ^ fo. 



350 Chapter 6. Equivariance 



Proof. 4 If no equivariant estimator has finite risk, then it makes no sense 
to talk about the MRE estimator. If p is strictly convex and 60 is equivariant 
with finite risk, let Y = (X x - X n , . . . , X n _i - X n ). Write 

<j>(t;y) = E 0 [p(6o(X)-t)\Y = y}. 

We first show that <$> is strictly convex as a function of t for fixed y: 

0(at + (1 - a)u\y) 

= Eo[p(a6o(X) - at + (1 - a)6 0 (X) - (1 - a)u)|y = y] 
< E 0 [ap(6 0 (X) - t) + (1 - a)p(« 0 (X) - u)\Y = y] 
= a</>(t;y) + (1 - a)0(u; j/), 

where the strict inequality holds because <5 0 (X) — £ cannot equal (5o(X) — 
a.s. [Pq] iit^u. We can also show that </> is not monotone in t a.s. This 
follows from the fact that since p is strictly convex and not monotone, 
p(x) —> oo as a; — > oc, and as x — » — oo and <5o(^0 — £ converges in 
probability to— oo or oo as £ — ► oo or £ — ► — oo, respectively. Since convex 
functions are continuous on the interiors of their domains, <t>{t\y) has a 
minimum at t — v(y) for each y and the minimum is unique by strict 
convexity. It follows that 6(X) = 6 0 (X) - v(Y) is the MRE estimator, 
since it minimizes Eo(p(6(X) — 0)). 

If p(£) = t 2 , let 6(X) be any equivariant estimator: 

E e 6(X) = E 0 6{X + 61) = 6 + Eo«(X) = 0 + c. 

The risk of 6 is its variance plus the bias squared, which is minimized by 
choosing c = 0. Hence the MRE estimator has c = 0 and is unbiased. □ 

6.1.2 Scale Problems* 

It turns out that a scale problem with positive random variables and a 
positive parameter is identical to a location problem. All one needs to do 
is replace 9 by log 6 and X { by logX*. The general scale problem can be 
defined in a fashion similar to the general location problem. 

Definition 6.16. Consider a decision problem with parameter space 1R + 
and action space IR + U {0}. A loss function a) is called scale invariant 
if L(0, a) = p{a/6) for some function p. If p increases as its argument moves 
away from 1, such a decision problem is called scale estimation. 

A decision rule S(x) is scale equivariant if 6(cx) = c6(x), for all positive c 
and all x. A function g(x) is scale invariant if g(cx) = g{x), for all positive 
c and all x. 



4 This proof is based on the proof of Theorem 6.8 in Chapter 1 of Lehmann 
^83) 

This section may be skipped without interrupting the flow of ideas. 



6.1. Common Examples 351 



A scale analog of Theorem 6.2 would say that fx\e(x\0) ~ 9( x /Q)/0- All 
of the other results concerning location problems have their counterparts 
in the case of scale problems with positive random variables. We will not 
restate them all here. It should be noted, however, that squared error must 
be changed to (log(a/0)) 2 . Also, if one finds an estimator for log 6, one 
should remember to exponentiate the result to produce an estimator of 6. 

Example 6.17. Suppose that {X n }^L x are conditionally IID t/(O,0) given 0 = 
0. Let X = (Xi, . . . , X n ). Then 0 is a scale parameter. Using the loss L(0, a) = 
[log(a/0)] 2 , we can find Pitman's estimator of log 9 by finding the posterior mean 
of log 0 based on a uniform prior for log 0. This "prior" translates into the "prior" 
1/0 for 0, since tp = log0 means dip = dO/6. The posterior density for 0 is then 



/e|x(0|x) = ^/ [x(n)>oo) (0), 

where x (n) = maxix*. (This is known as a Pareto distribution.) The mean of 
log0 is 

f°°nx n log 0 JA , f°° 1 

/ n n+ i de = log x + / nt exp(-nt)dt = log x + - , 
Jx u Jo n 

by making the transformation t = log(0/x). The MRE estimator of 0 becomes 
X exp(l/ri). 

An alternative invariant loss function, which is more like squared-error 
loss, is L(0,a) = (0 - a) 2 /0 2 . Here, we need not assume that the random 
variables are positive, because we will not have to take logarithms. An 
analog to Theorem 6.8 is proven next. 

Theorem 6.18. 5 Let 0 be a scale parameter and let L(0, a) = (0 - a) 2 /0 2 
and Y = (Xi/\X n \, . . . ,X n /\X n \). Let 6 Q be an equivariant estimator with 
finite risk. Then the equivariant estimator with smallest risk is 

6n(x) Vi[f>o(X)\Y} 

6o{x) m(xWy 

Proof. Let 6 0 be an arbitrary equivariant estimator with finite risk. By the 
scale analog to Lemma 6.6, all other equivariant estimators have the form 
6 0 (X)/v(Y), where v is scale invariant. Since the risk function is constant 
lor an equivariant 6, 

R(0,S) = R(l,6) = E 1 [6 0 (X)/v(Y)-l] 2 
= E 1 {E 1 [(fi 0 (X)/«(y)-l) 2 |y]}, 

which is minimized by minimizing E 1 [(6 0 (X)/t;(y) - l) 2 |y = „] uniformly 
in y. To do this, choose v(y) = Ei[S^(X)\Y = J/J/E^*)^ = »]. □ 
It can be shown that the MRE estimator is also the formal Bayes rule 
with respect to the improper prior 1/0. 

5 This theorem is used in Example 6.60. 



352 Chapter 6. Equivariance 



Theorem 6.19. Under the same conditions as in Theorem 6.18, the MRE 
estimator can be written as the formal Bayes rule with respect to a prior 
having Radon-Nikodym derivative 1/9 with respect to Lebesgue measure, if 
it has finite risk. 

Proof. We begin with the equivariant estimator |X n |. Let Y be as in The- 
orem 6.18. The transformation from x to (y T ,x n ) T has Jacobian jxnl 71 " 1 , 
so 

/v,x n |e(y,x n |l) = f(\x n \y)\x n \ n -\ 

both for y n = +1 and y n = -1. So, the conditional density of X n given 
Y = y is 

r fx u, n /(My)lsnl"- 1 
/x " |y - e(Xn|y,1)= //(| u | y )N«-Mu- 

It follows that 

nV) / |t*|/(|tt|y)|u|»-idu J™f(uy)u»du ' 

because both integrands are symmetric around 0 as functions of u. Now 
make the change of variables u — \x n \/z with inverse z = \x n \/u. Then 
du = — \x n \dz/z 2 . It follows that 

v{y) - l^l //(lxn|f)2 _ n _ 2dz - 1**1 //(.) 

Hence, the MRE estimator is where 

M J/(f)«-"- 8 <fa 



«(X) 



«(v) //(f)2-"- 3 ^ 



To see that this is the formal Bayes rule with respect to the "prior" 1/ d, note 
that the posterior fe\ X {0\x) is proportional to f(x/0)/0 n+l . The expected 
loss is (aside from the proportionality constant) 



By expanding this as a function of a and taking the derivative, we find that 
the minimum occurs at a = 6(x). D 
The reader should note that the measure X(A) = / A d0/0 is invariant 
under scale changes. That is, the measure of a set A of positive numbers is 
the same as the measure of \c\A for all real c. 



6.2. Equi variant Decision Theory 353 

6.2 Equivariant Decision Theory 
6.2.1 Groups of Transformations 

Equivariance occurs in more general situations than just location or scale 
problems. For example, it can occur in combined location-scale problems 
with a two-dimensional parameter. In fact, it can occur whenever there is 
a group of transformations that acts on the sample space, the parameter 
space, and the action space in the "same way." We will now make this 
notion more precise. 

Definition 6.20. A group is a nonempty set G together with a binary 
operation o called composition such that 

• for each g u g 2 e G, g\og 2 e G; 

• there exists e G G such that e o g = g for all g € G; 

• for each geG there is g' 1 € G such that g' 1 o g = e; 

• for each g U g 2l g 3 eG,^o (g 2 o g 3 ) = (g x o g 2 ) o g 3 . 

The element e is called the identity and for each g, g~ l is called the inverse 
ofg. A group is abelian if o is commutative, that is, if g\og 2 =g 2 o gi for 
all gi and g 2 . 

There appears to be some asymmetry in the definition of the identity 
and inverses. This is illusory, however. 

Lemma 6.21. Let G be a group. For all g e G, g o e = g and g o g~ l = e . 
There is only one identity element, and for each geG, there is only one 
inverse of g. The inverse of g~ l is g. 

PROOF. Let geG. Let h be the inverse of g~ l . Since g- 1 o g = e, (g~ l o 
g)og 1 =eog~ 1 = g- 1 . It follows that 

h ° ((ST 1 o g) o g- 1 ) = h o g~ l = e . 

The left-hand side of this last equation can be rewritten using the associa- 
tive property as 

(hog' 1 ) o (gog- 1 ) = gog' 1 . 

Hence yofU e . It follows that g satisfies the property required to be 
called the inverse of g~ l . Next, note that eoe = eand(/- 1 op = e , so 

9°e = go(eoe) = go(eo(g-log)) = go((eog- l )og)=go(g- 1 og)=g. 

For the uniqueness claims, first, suppose that hog = e . Then, using what 
we have just proved, 

h = hoe = ho(go g' 1 ) = (fe 0 g) o g~ l = e o = 



354 Chapter 6. Equi variance 

which means that the inverse of g is unique. It follows from what we proved 
above that g is the unique inverse of g~ l . Finally, let h o g = g for all g. 
Then 

h = ho(go g~ l ) = (h o g) o g~ l = g o g~ l = e. 

Hence, the identity is unique. □ 
Sometimes two seemingly different groups are essentially the same. 

Definition 6.22. Let G\ and G 2 be groups with compositions oj and o 2 , 
respectively. Let 0 : G\ — > G 2 be a one-to-one onto function such that, for 
all h e Gi, <t>(g°\h) = <K<7)°2</>W- Then (p is called a group isomorphism. 

The following proposition is straightforward. 

Proposition 6.23. Let <\> be a group isomorphism between G\ and G 2 . 
Then 4>~ l is a group isomorphism between G 2 and G\. Also, (f) maps the 
identity inG\ to the identity in G 2 . Also, <p maps the inverse of each g E G\ 
to the inverse of <j)(g) G G 2 . 

The groups that will most interest us are groups of transformations. The 
set U in the following definition can be the sample space X or the parameter 
space Q, or the action space N. 

Definition 6.24. A measurable function / : U -> U is called a transfor- 
mation ofU. The function e(u) = u for all u € U is called the identity 
transformation. 

Proposition 6.25. Suppose that G is a set of transformations of a set 
U with e e G being e(u) = u for all u € U. Let composition o be the 
composition of functions. If G is a group with identity e, then every element 
of G is one-to-one. 

Example 6.26. Here are some examples of groups of transformations. When we 
refer to these examples in the future, we will call the ith one "Group i." 

1. U = H n , g c (xi , . . . , x n ) = (xi + c, . . . , x n + c) for each c € 1R. The identity 
is go, the inverse of g c is p_ c , and the composition is g a ogb = This 
group is abelian. 

2. U = H n , 9c(x u • • • ,*») = (cx u • ■ • ,cx n ) for each c> 0. The identity is flri , 
the inverse of g c is g\/ c , and the composition is g a o g b = ^ab- This group 
is abelian. 

3. U = M n , ^ ( a,b)(xi, . . . , x n ) = (bxi + o, . . . , te„ + a) for each 6 > 0, a € ! R. 
The identity is ^ (0 ,i), the inverse of y (o>6) is ft-o/6,i/6)» and the composition 
is S ( a,b) ofl M = ^(bc+a,bd). This group is not abelian. 

4. U = lR n , pa(o;) = Aa;, where A 6 GL(ra), the set of nonsingular n x n 
matrices. The identity is g u the inverse of is g A -i , and the composition 
is matrix multiplication ^o^ = <Mb- This group is not abelian. This is 
called the general linear group of dimension n. 



6.2. Equivariant Decision Theory 355 



5. U = H n , g P (x\, . . . ,x n ) = (# P (i), • . . ^p(n)) 5 where p is any permutation 
of (1, ...,n). The identity is p(i,..., n ), the inverse of <^> is <? p -i, and the 
composition is composition of permutations, g pi o g P2 = p pi op2 . This group 
is not abelian. 

6. U can be any measurable set and G can be the set of all one-to-one mea- 
surable functions whose inverses are also measurable. 



If A C U, we will use the shorthand notation gA to denote the set 

gA = {u € U : u = gy for some € 

Definition 6.27. Let Vo be a parametric family with parameter space fi 
and sample space (X,B). Let G be a group of transformations of X. We 
say that G leaves P 0 invariant if for each g € G and each 0 £ f2 there exists 
0* eft such that P^A) = P e *{gA) for every 4 € B. 

It is easy to see that the 0* in the Definition 6.27 is unique. 

Lemma 6.28. Suppose that G leaves V 0 invariant Then, for each g eG 
and 0 eft, 0* in Definition 6.27 is unique. 

Proof. Let g E G and 0 e ft be given. Suppose that both 0* and 0' satisfy 
for every A e B, 

Pe(A) = P 9 .(gA) = P 9 ,(gA). 
It follows that for every A E B, P${g' l A) = P**(A) = P,,(A). Since 
distinct elements of the parameter space have to provide distinct probability 
measures, this last equation implies that 0* = 0' . □ 
We will call the unique value 0* by the name g0 to indicate its connection 
to both g and 0. We can try to understand intuitively what it means to say 
that G leaves V 0 invariant. Suppose that we believed that X had distribu- 
tion P 0 given 9 = 0. We already know what the conditional distribution 
of gX is 

P' e (gX Gi) = P^(X € p- 1 ^). 
This has nothing to do with equivariance yet. It is a simple consequence 
of what we already know about the induced distribution of a function of a 
random quantity. What invariance of distributions means is that the second 
equation below holds (the others are all consequences of probability theory 
and group theory): J 

P f e{X e g-'A) = P 6 {g^A) = Pg${gg~ l A) = P- g9 (A) = pi e (X e A). 

So, we see that the conditional distribution of gX given 9 = 0 is Pg 0 , which 
is the conditional distribution of X given 6 = gO. ' 

Proposition 6.29. Suppose that G leaves V 0 invariant, then, for each g e 
G the transformation g : ft ~> ft is one-to-one and onto. Also G = {g : g £ 
G] is a group, and g~ l —g~ l . 



356 Chapter 6. Equivariance 



The proof of this proposition is straightforward and is good practice for 
those readers who have become rusty in group theory. See the proof of 
Lemma 6.31 below for some guidance. 

Definition 6.30. If G leaves Vq invariant and L(0,a) is a loss function on 
fi x H, we say that the loss is invariant under G if, for each g £ G and 
each a e N, there exists a unique a* E N such that L(g0,a*) = L(0,a) for 
all 0 € Q,. We will denote a* by ga. 

Lemma 6.31. // the loss is invariant under G, then, for each g € G, the 
transformation g : N — ► N is one-to-one and onto. Also G = {g : g € G} is 
a group, and g~ l = g~ l . 

Proof. First, we show that g is onto for all g G G. Let a € N and g € 
G. Recall (Proposition 6.29) that g~ l = g~ l . For all 9 € ft, L(0,a) = 
Hjj~ l 0,g~ l a). Since this is true for all 6 and g~ l is one-to-one, it follows 
that Li^)^g" l a) = L(gil), a), for all ip e f2 (just let ip = g~~ l 6). By the 
definition of g it follows that gg~ l a = a, hence g is onto. Applying this 
same argument to g~ l gives that g~ x ga = a. This also shows that g is one- 
to-one and g~ l = g~ l . Clearly, the identity transformation on N is e. The 
composition of g and h is clearly gh, and_the associative property follows 
directly from the associative property in G. □ 

Example 6.32. Consider Group 3. Suppose that Xi,...,X n are conditionally 
IID JV(/i,<j 2 ) given 6 = (/i,a). Let 

G = {(o,6):6>0,a6lR}. 

If g = (a, 6) and 0 = (/x, cr), then £0 = (ba, 6/x + a) and 

J bA (baV^) n \ W° 2 t! J 



/ _^-exp(--i i y:( 2i -M) 2 )^ = ^(>l)- 



Suppose that the loss function is L(0, d) = (d - m) 2 /* 2 - Then, if 3d = 6d + a, we 
get 

ri-a -j\ {b d + a-by- a) _ T(f) « 
L{g9,gd) = ± ^ MM)- 

We might ask, "What is the set of all invariant loss functions?" That is, what 
is the set of all L such that, for every cr, /x, d, a, 6, 

M(m, <r), d) = L((bfi + a, 6a), bd + a)? 



6.2. Equivariant Decision Theory 357 



Since this equation must hold for every cr, /x, d, a, 6, it must hold for b = 1/cr and 
a — —\ijo no matter what \i and cr are. It then follows that 



L(( M ,<r),d) = Z,((l,0),^) 



for all d, /x, and a. That is, L((/z, cr), d) = p([d-/z]/cr), for some arbitrary function 
p. It is clear that any L of this form is invariant, so we have found all invariant 
loss functions. 

The method used at the end of Example 6.32 is actually a very general 
method for finding all invariant functions. The first step was to find a 
necessary condition for the function to be invariant. The second step is to 
check that the condition is also sufficient. A similar method works when 
trying to find all equivariant functions. (See Example 6.34 on page 357 for 
an illustration.) 

Definition 6.33. A decision problem is invariant under G if Vo and the 
loss are invariant. In such a case, a nonrandomized decision rule 6(x) is 
equivariant if 6{gx) = g6(x) for all g e G and all x e X. A randomized 
rule 6*{x) is equivariant if 6*(gx){gA) = 6*(x)(A) for all A e a, x e X, 
and g € G. A function v is invariant if v(gx) = v{x) for all x e X and all 
9 e G. 

We will rarely use randomized equivariant rules, except in discussions of 
invariant tests (c.f. Section 6.3.3). 

Example 6.34. Consider Group 3. Let N = IR and suppose that we are only 
estimating the location parameter. For example, L(0, d) = (Ox - d) 2 10\ Here 
%*) = §6(x) means that 6(b Xl + a,. . . ,bx n + a) = b6(x) + a. Suppose that 
a = -bx and b = where s 2 = ZtM ~ *)V(n - 1). Then 

^(x) = x + s 6(Hi^,...,£^-l£). 

antTuleT^ fUnCti ° n ° f the ab ° Ve f ° rm is e <l uivari ant, we have found all equivari- 

element of N, while the function h(x) = (x,s) is equivariant and can be thought 
of as an element of G. With this notation, we have written 6(x) ■= h{x)v(x). 

Example 6.34 suggests a generalization of Lemma 6.6. 

Lemma 6.35. Let h : X -> G be equivariant Then 6 : X -> N is equiv- 
ariant if and only ifh~ l 6 is invariant. (Here h~ l means the element of G 
that is the inverse ofh, not the inverse of the function h, which might not 
even exist.) 

Proof. For the "if" part, assume that h~ 1 6 = v is invariant. Then v(x) e N 
and 6{x) = h{x)v{x). So 

6(gx) = h(gx)v{gx) = ph(a:)i;(x) = ^(x), 



358 Chapter 6. Equi variance 




so v is invariant. 



□ 



Example 6.36. Consider Group 1. Lemma 6.6 already says that 6 is equivariant 
if and only if — 6q + 6 is invariant , where 6q is an arbitrary equivariant function. 

Consider Group 3. If N = H, Example 6.34 showed how 6q l 6 is invariant, 
where 6o{x) = (a:, s). Now suppose that N = Q = G and 



Then L is invariant and 6(x) is two-dimensional, say ^(x)). To say 6(gx) = 

<?<$(x) means that, for every a, 6, x 6{bx\ + a, . . . , 6x n + a) = (66i(x) 4- a, b62(x)). 
Now, we have already seen that 6o(x) = (x, s) is equivariant, so the most general 
equivariant estimator is 6(x) = 6q(x)v(x), where v(x) € N is invariant. That is, 
let V2{x) be an arbitrary positive invariant function and let v\(x) be an arbitrary 
real- valued invariant function. Then 6(x) = (svi(x) + x,8V2(x)) ig the general 
form of an equivariant estimator. 

Definition 6.37. An invariant function v(x) is called maximal invariant 
if, for every invariant function u(x), v(x\) = v(x2) implies u(x\) = Ufa) 
(i.e., u is a function of v). For each x € X, we call 



the orbi£ of x. 

It is clear that an invariant function is always constant on orbits. In the 
statement of Theorem 6.8, Y is maximal invariant. Also, in the statement 
of Theorem 6.18, Y is maximal invariant. In invariant decision problems, 
the risk function of an equivariant decision rule is constant on orbits. This 
follows trivially from part 4 of the following lemma. 

Lemma 6.39. 6 In the notation of Definition 6.37, 
Lye 0(x) if and only ifx€ 0(y); 

2. orbits are equivalence classes; 

3. a maximal invariant assumes distinct values on different orbits; 

4. suppose that m(x,0) is invariant under the group actions, that is, 
migx.gO) = ra(x,0) for all x,g,0. Then the distribution ofm(X,0) 
given 6 = 0 is a function of 0 through the maximal invariant in Q. 



6 This lemma is used in the proofs of Lemmas 6.65 and 6.66. 



L((»,cr),d) = 



(d2-tf 
a* 



+ log— . 

1 G 



0(x) — {y : y = gx for some g £ G} 



(6.38) 



6.2. Equi variant Decision Theory 359 



Proof. Parts 1 and 2 are trivial. 7 For part 3, suppose that v is maximal 
invariant. Consider the function O : X — ► 2 X defined in (6.38). Clearly, O 
is invariant, hence it is a function of v. This means that if 0(x) ^ 0(y), 
then v(x) ^ v(y). That is, v assigns different values on different orbits. For 
part 4, let r(0) = P' e [m{X,0) e B] for an arbitrary set B. Then, 

Pse[m(X,gO) G B) = P 0 [m(gX,gO) 6 B) = P e [m(X,0) 6 B], 

by invariance of m. So, r(gO) = r(0), and r is invariant, hence it is a function 
of the maximal invariant in ffc. □ 

Corollary 6.40. In an invariant decision problem, the risk function of an 
equivariant decision rule is constant on orbits in the parameter space. 

This corollary gives a plausible justification for restricting attention to 
equivariant decision rules. Since the risk function is constant on orbits in 
the parameter space when the loss is invariant, this makes it easier to 
compare equivariant rules by means of their risk functions. In particular, if 
the group acts transitively on the parameter space (i.e., there is only one 
orbit), then the problem of noncompar ability of risk functions disappears 
altogether. 

6.2.2 Equivariance and Changes of Units 

One popular justification for the use of equivariant rules is that the result 
one obtains should not depend on the units in which the variables and 
parameters are measured. For example, suppose that we are estimating 
a length in feet and our measurements are in feet. If we were then to 
change our measurements (and the length) to inches, our estimate should 
be 12 times as large. This sounds like a requirement that the estimator be 
scale equivariant. Similarly, if we are estimating a temperature in °C and 
we convert all measurements to °K, then we should just add 273 to the 
°C estimate to get the °K estimate. This sounds like a requirement that 
the estimate be location equivariant. However, neither of these examples 
has anything to do with equivariance. First, we show that estimators need 
not be equivariant in order for the units of measurement to be converted 
correctly. Afterwards, we show why changes of measurement scale have 
nothing to do with equivariance. 

Example 6.41. Suppose that 0 is tomorrow's temperature in °C, and suppose 
that I have a prior distribution for 6 which is N(— 12, 100). 8 I will observe X ~ 
iV(0, 25) (in °C) given 0 = 0, and the loss is L(0, a) = (0 - a) 2 . Ignoring absolute 
0, this problem is a location problem. The only location equivariant rules are 



See Problem 15 on page 139 for the definition of equivalence class. 
This example was written during the winter. 



360 Chapter 6. Equi variance 



6 c (x) = x + c. The Bayes rule (in °C) is 



E(0|X = x) = f 12 



12 4x - 12 



25 " r 100 

This is clearly not equivariant. Does that mean that it will violate the rule of 
changing units? Of course not! 

Suppose that we change all units to °K. The new parameter is 0* equal to 
tomorrow's temperature in °K and 0* = 273 + 0, so the prior for 0* (in °K) is 
AT(261, 100). The datum we will observe is X* = X + 273 - N(0*,25) (in °K) 
given 0* = 6*. The loss function is still assumed to be L*(0*,a*) = (0* - a*) 2 . 
Everything is now ready for finding the Bayes rule, which is the posterior mean 
of 0* (not the mean of 0, which would clearly be too small by 273). 

E(0-|X*=x') = ^±261 = 4x- i 2 +273> 

O 0 

which is just what we would like. 

Notice that, in Example 6.41, we have treated 6, X, ©*, and X* as 
pure numbers, but we were careful to say what the numbers stood for. A 
great deal of the confusion about changing units is caused by ignoring this 
simple, but vital, procedure. We now turn to a careful discussion of this 
point. 

Changing units of measurement has nothing to do with equivariance. It 
is simply a reparameterization. When you reparameterize the problem, you 
must reparameterize the loss function, the prior, and the likelihood. Who 
would ever dream of using the same proper prior for °K as for °C? The 
same applies to any change of units. Surprisingly, the same applies even to 
more general transformations that do not correspond to changes of units. 

Example 6.42. Let X ~ U(0,6) given 0 = 0. Suppose that the loss function is 
L(0,a) = (0 - a) 2 and the prior is /e(0) = 0~ 2 /[i,oo)(0). Then the posterior is 

fs\x(0\x) = ^/[cooW, 

where c = max{l,x}. The Bayes rule is the posterior mean E(0|X = x) = 2c. 

Suppose that we reparameterize to 0* = 0 2 . If there is going to be a connection 
between the two decision problems, the loss function had better transform in such 
a way that we are essentially estimating the same thing. That is, the loss had 
better be L*{0*,a*) = - y/^f . The prior for 0* is 

/e*(n = ^[i,oo)(0, 
20 5 

and the conditional distribution of X given 6* = 0* is 1/(0, As a red 
herring, we could also transform the data to X* = X and then 

/x-,e-(x*^) = ^=/to,,(x-). 



6.2. Equivariant Decision Theory 361 



(This makes the situation look more like the equivariance setup.) Now, we can 
find the posterior of 0*, 



/e.|*-(0>*) = £j'[c..oo)(«*), 



where c* = max{x*, 1} = c 2 . The Bayes rule for the new loss function is not the 
posterior mean, rather it is the a* that minimizes 



By expanding the square and differentiating with respect to a*, we find that the 
posterior expected loss is minimized if a* = 4c* , which is precisely the square of 

In each of the above examples, the Bayes rule in the reparameterized 
problem is the reparameterization of the original Bayes rule. This is actually 
true in general. 

Proposition 6.43. Suppose that we reparameterize to 6' = g(Q) where 
g : ft -4 ft' is bimeasurable and W = ft'. // the loss function changes 
from 1,(0, a) to L'(0',a') = L{g~ l (6'),g~ l (a')) and 6{x) is the formal 
Bayes rule for prior f e (0) (based on data X = x with conditional den- 
sity fx\e{x\0))> then the formal Bayes rule in the reparameterized problem 
is6'(x)=g(6(x)). 

A note about transformations and loss functions is in order here. In 
Proposition 6.43, we transformed the loss function L to V . In the usual 
equivariance setup, the loss function L is assumed to be invariant. That is, 
L (9 {0 f ),9 l {a')) = L{0',a'), as was the case in Example 6.41. But this 
was a mere coincidence and had nothing to do with changing units. Even 
if the loss function had not been location invariant in Example 6.41, the 
Bayes rule would have respected the change of units, so long as the correct 
loss function were used after the change of units. Consider the following 
modification of Example 6.41. 

£?ZS££ S:T P ^: { t ample 641; see page 359 >- ^ that the 

M<,,a) -\ 2(0 -a) 2 if 0 < 0. 

This says that an error of a certain magnitude is twice as costly if the true 
temperature » below freezing than if it is freezing or above. If we try to use he 
same loss function in »K, then no such distinction is made in the costs of errors 
of the same magnitude. It is ludicrous to claim that these two decision problems 
are essentially the same problem with different units. Of course, the loss L here 
is not invariant. The transformed loss 

L'<e' a') = f (*'-«') 2 if 0' > 273, 
M "' a; \ 2(9' -a'f if<?'<273 

is the appropriate one to use in the °K scale. 



362 Chapter 6. Equi variance 

Note that no transformation of the data is needed in Proposition 6.43. 
If one wishes to parallel the equivariance situation more completely, one 
can feel free to transform the data also, but it makes no difference to the 
conclusion of the proposition, so long as the transformation is bimeasurable, 
since the posterior distribution will be unchanged. The conclusion here is 
that there is no need for decision rules to be equivariant in order to obey 
conversion of units. Bayesian decision rules obey conversion of units without 
being equivariant. 

Finally, we consider the root of the misconception that equivariant esti- 
mators are required in order to obey conversion of units. 9 Imagine that I 
will measure the length of a table with an inaccurate device. Let the mea- 
surement I hope to get in feet be denoted X, and let the sample space be 
X = H + . That is, X is the set of possible measurements, in feet, which I 
might obtain. Let 0 denote the "true" length of the table (whatever that 
means) in feet, and suppose that the conditional distribution of X given 
6 = 0 is denoted P#, where 0 can be any positive number. That is, Cl = H + 
is the set of possible values of the "true" length of the table in feet. (Ob- 
viously some values of 0 are far less likely than others as candidates for ©, 
but we will ignore the Bayesian aspects of this problem for the time be- 
ing.) If we convert our observed measurement to inches, we get X' = \2X. 
This situation resembles Group 2, scalar multiplication. Suppose that the 
parametric family Vo = {P$ : 0 E fi} is invariant under Group 2. We could 
mistakenly think of X f as gi 2 X and then we could construct g l2 0 = 120. 
As we noted earlier, it is now perfectly correct to say that the distribution 
of 12X is Pi26>. But here is where things get confused. The transformed 0, 
namely </ 12 0, is supposed to be an element of ft, which consists of the pos- 
sible values of the "true" length of the table in feet, not inches! Although it 
is perfectly permissible to think of 120 as the number of inches representing 
the true length of the table in inches, it is absolutely forbidden to think of 
g l2 0 as anything other than a possible length of the table in feet. In like 
manner, the sample space X is the set of possible measurements in feet, not 
inches. The transformed measurement g\ 2 X is 12 times as many feet as X, 
not the number of inches in X feet. Otherwise, how would we ever distin- 
guish whether the number 12 € X stood for 12 feet or 1 foot converted to 
12 inches? It cannot be both ways. We made it perfectly clear that x G X 
stands for x feet and hence 12x € X stands for 12x feet, not x feet con- 
verted to inches. Hence g x2 X is not the converted measurement of the table 
to inches, but rather a measurement of the table in feet, which is 12 times 
as large as the measurement X. There is no other mathematically correct 
way to interpret these transformations. Hence, the invariance of the dis- 
tributions has absolutely nothing to do with conversion of units from feet 



9 The author is indebted to Morris H. DeGroot for personal discussions about 
equivariance and invariance which helped immensely in clarifying these concepts. 



6.2. Equi variant Decision Theory 363 



to inches. That conversion is handled in a straightforward manner as a 
reparameterization, the way we did in Example 6.41 and Proposition 6.43. 



6.2.3 Minimum Risk Equi variant Decisions 

In location estimation with squared-error loss, Pitman's estimator was 
MRE and it was the generalized Bayes rule with respect to a uniform 
"prior" distribution. The feature of that prior distribution which made it 
the one to use in location estimation was the fact that it is invariant with 
respect to location shifts. Similarly, in scale estimation we discovered that 
the MRE, with a particular invariant loss, was also the Bayes rule with 
respect to an improper prior /xe with Radon-Nikodym derivative 1/6 with 
respect to Lebesgue measure. This measure is invariant with respect to 
scale changes. The pattern emerging here can be extended to more general 
groups, once we know how to find invariant measures. 

Definition 6.45. Let G be a group with a <r-field of subsets T. Suppose 
that for each g G G and A G T, gA G T and A~ l G T. A measure A on T 
is called left Haar measure LHM (or left invariant measure) if, for every 
g G G and every A € T, X(gA) = X(A). Similarly, p is called right Haar 
measure RHM (or right invariant measure) if p(Ag) = p(A) for every A G T 
and every g G G. 

It should be noted that every positive multiple of an invariant measure 
is another invariant measure, so we may introduce an arbitrary constant 
multiple if we wish. It should also be easy to see that if G is abelian, then 
RHM=LHM. 

Example 6.46. Group 1 is abelian, so LHM=RHM. Note that f A dx = f A+c dx, 
for all measurable A and all real c, so Lebesgue measure is invariant. 
Group 2 is abelian, so LHM=RHM. Note that 



/ -dx =/-—=/ -dy, 
JcA x JaV c JaV 



so the measure with Radon-Nikodym derivative l/x with respect to Lebesgue 
measure is invariant. 

Group 3 is not abelian, so we need to find LHM and RHM separately. The 
group action is (a, b) o (c,d) = (6c -I- a, bd). Suppose that h(x,y) is a Radon- 
Nikodym derivative for LHM with respect to Lebesgue measure. Then 

/ y)dxdy = / h(x, y)dxdy. 

J A J(a,b)A 

Transform the left-hand side by y = w/b and x = (z - a)/b. The Jacobian is 
J = 1/6 2 , and we get 

/ h(x,y)dxdy = / ~/i ( - - Q , ^ dzdw = / h(z,w)dzdw. 

J A J(a t b)A b2 V b hl J(a,b)A 



364 Chapter 6. Equi variance 



Since this must hold for all a, 6, A, we have 

1 , (z - a w\ . N 

for all a, 6 and a.e. [dzdw]. So, if a = z and 6 = it;, we must have h(z^w) = 
/i(0, l)/w 2 . It follows that LHM has Radon-Nikodym derivative 1/y 2 with re- 
spect to Lebesgue measure. Next, suppose that h(x,y) is the Radon-Nikodym 
derivative for RHM. Then 



/ h(x, y)dxdy = / h(x, y)dxdy. 

J A J A(c,d) 



Transform the left-hand side by y = w/d, and x = z - wc/d. Then the Jacobian 
is J = 1/d, and we get 

/ h(x,y)dxdy = / ~h (w — — r , -"H dzdw = / h(z^w)dzdw. 

J A JA^d) d V d d ' i(a,6)A 

Since this must hold for all c, d, A, we have 

for all c,d and a.e. [dzdw]. So, if c = z and d = w, we must have /i(z,tz>) = 
h(0, l)/w. It follows that RHM has Radon-Nikodym derivative 1/y with respect 
to Lebesgue measure. This is not the same as LHM. 

Not only are measures of sets invariant under group operations when 
using LHM, but certain integrals are also invariant. 

Lemma 6.47. 10 If X is LHM on G and f is integrable over G, then for all 
9*G, 

/ f(goh)d\(h)= / f{h)d\{h). 

JG JG 

Proof. First let f(h) = I B {h) for some B £ T. Then 

J f(g o h)d\(h) = J I B {g o h)dX(h) = ^ ib dX W 
= X(g~ l B) = X(B) = / f(h)dX(h). 

JG 

By adding, we can extend to simple functions /. By the monotone conver- 
gence theorem A.52, we can extend to all nonnegative measurable functions, 
and by subtraction to all integrable functions. D 
For a detailed discussion of Haar measure, see Nachbin (1965) or Halmos 
(1950, Chapter XI). For example, there are results giving conditions under 
which LHM exists. Since we will only use Haar measure explicitly when it 
does exist, we will not prove its existence. However, we will need to know 
that Haar measure is essentially unique. 



10 This lemma is used in the proofs of Lemmas 6.55 and 6.62. 



6.2. Equivariant Decision Theory 365 



Lemma 6.48. 11 Let (G, o) be a group, and let (G, T) be a topological space 
with the Borel a -field. Suppose that X is a -finite and not identically 0 LHM 
on (G,T). Suppose that the function f : G x G — ► G defined by /(<7,ft) = 
g" 1 oh is continuous. If X' is also a -finite and not identically 0 LHM on 
(G, T), then there exists a finite positive scalar c such that A' = c • A. 

Proof. The first step is to prove that r(g) = I B -i(g)/X(Ag) is a mea- 
surable function of g for each A, B E T. Since f(g, ft) is continuous, it is 
continuous in g for fixed ft, hence f*(g) = f(g,e) = g~ l is continuous, 
hence measurable. If B € T, f*~ l (B) = B~ l , hence B~ l e r. It follows 
that /b-i(p) is measurable. The function f'(g,h) = hog = f(f*(h),g) is 
also continuous, hence measurable. It follows that v{g, h) = {g~ l ,h o p) is 
continuous and measurable. It is easy to see that v' 1 = v, so if A € T, 
G x A € T ® r, hence v(G x A) eT^F and rofo, ft) = I v{G xA){9~\h) is a 
measurable function. Define t(g) = / m(g,h)dX(h). By Lemma A.67, this 
is a measurable function. Now notice that 4(gxA)(0"\ ft) = Ugih) and 
calculate 

= J lA 9 {h)d\{h) = \{Ag). 

It follows that r is measurable. 

Next, we prove that the following two one-to-one bicontinuous functions 
preserve measure in the product space (G x G, T (g> T, A' x A): 

T\{g y h) = (g,goh), 
T 2 (g>h) = (ft op, ft). 

The proofs are similar, and we only prove that T 2 preserves measure. Note 
that E € T ® r implies that, for every ft e G, 

{<7 : fo, fc) € T 2 (E)} = ft{<? : ft) € £} = ft£\ 
where E h - {g : (<?, ft) E £}. It follows from Tonelli's theorem A.69 that 

V x X(T 2 (E)) = y I T2{E) { 9l h)d\' x X( 9l h) = J X f (hE h )dX(h) 

= y A'(E*)dA(/i) = A'x A(£). 

So T 2 preserves measure. Also, T^T 2 {g, ft) = (fto^-i) preserves measure. 
Hence, for every nonnegative measurable function v : G x G -> H, 

y ufo, ft)dA' x X(g, ft) = y p-i) d y x \( g , ft). (6.49) 



(l^TheorTm * Pr ° Ve C ° r ° llary 6 ' 52 - The proof is ada P ted from Halm os 



366 Chapter 6. Equi variance 



This is proven by noting that it is true for indicators of events, hence 
for simple functions, hence for nonnegative measurable functions by the 
monotone convergence theorem A.52. 

Let AeT have 0 < X(A) < oo and let B e T. Define 

r(n) = Ib ~ i{9) = lB{g ~ 1] 
K9) X(Ag) \{Ag) ' 

which we have already shown to be measurable. We prove next that 

X'(B) = X'(A) J r(h)dX(h). (6.50) 

Use Tonelli's theorem A.69 and (6.49) to write 

X'(A) J r(h)dX(h) 

= J I A (g)r(h)dX' x X(g,h) = J I A (h o g^g-^dX' x X(g,h) 

= J A(A 5 - 1 )r( 9 - 1 )dA'( fi ) = j I B (g)dX'{g) = X'(B), 

where the second to last equation follows from r(g~ 1 )X(Ag~ 1 ) = Ib{9)- 
Next, apply (6.50) with A' = A to get 



X(B) = \{A) J r{h)d\{h). 



Multiply both sides of this equation by X'(A) and apply (6.50) again to 
get X , {A)X(B) = X(A)X'{B). Let c = X , (A)/X(A). Since (6.50) is true for 
all £, it is true if 0 < X'(B) < oo. It follows that 0 < X'(A) < oo, hence 
0 < c < oo, and the proof is complete. 0 
For the rest of this text, whenever LHM or RHM and groups are discussed, 
we will assume that the group satisfies the conditions of Lemma 6.48 and 
that the measures are cr-finite and not identically 0. 

Lemma 6.51. 12 If X is LHM on a group G, then p(A) = A(A" 1 ) is RHM, 
and we call p the RHM related to A. 

Proof. Note that p(Ag) = A^" 1 ^" 1 ) = X(A~ l ) = p(A). □ 
The following corollary to Lemmas 6.48 and 6.51 now follows easily. 

Corollary 6.52. Assume the conditions of Lemma 648. If p and p' are 
both a -finite and not identically 0 RHM on (G, T), then there exists a finite 
positive scalar c' such that p 1 = dp. 



12 This lemma is used to prove Corollary 6.52 and Lemma 6.54. 



6.2. Equi variant Decision Theory 367 



The following result, whose proof is based on the same concept as the 
proof of Lemma 6.47, is useful for converting between integrals with respect 
to LHM and RHM. 

Proposition 6.53. 13 If X is LHM, p is the related RHM, and f is inte- 
grable with respect top, then J f(g)dp(g) = f f(g~~ l )d\(g). Iff is integrable 
with respect to X, then J f(g)d\(g) = J f{g~ 1 )dp(g). 

The following result gives a method for converting one LHM or RHM 
into many others. 

Lemma 6.54. 14 Let p g (B) be defined as p{gB), and let \ 9 {B) = X(Bg). 
Then p g is RHM and X g is LHM for each g e G. 

Proof. Since X g (hB) = X{hBg) = X(Bg) = \ g (B), we have that X g is 
LHM, and a similar argument works for p g . □ 

By Lemma 6.48, X g is a multiple of A. Let the multiple be c g . Similarly, 
by Corollary 6.52, p g is a multiple of p, so define c' g by p g (B) = c f g p(B). In 
abelian groups, c g = c' g = 1 for all g, if A and p are related. We introduce p g 
because it will play an important role in the proof that Pitman's estimator 
is the formal Bayes rule with respect to an invariant measure. 

We obtain interesting results if we replace A by p in Lemma 6.47 and 
replace p by A in Proposition 6.53. 

Lemma 6.55. 15 If p is RHM and g e G and f is integrable with respect 
to p, then 

J f(gh)dp(h) = J f(h)dp(h). 
IfX is LHM andgeG and f is integrable with respect to X, then 

J f(hg)dX(h) = eg-, J f(h)dX(h). 
Proof. In a manner similar to the proof of Lemma 6.47, we can prove that 

J f(h)dp g {h) = j f{g- l oh)dp{h), 
for all g£G.lt follows that 

/ f(g°h)dp(h) = J f(h)dp g -r(h) = J f(h)dp(h). 

The proof of the other part is virtually identical, using Proposition 6.53. □ 
Actually, the numbers c g and c' are related. 



13r 



This proposition is used in the proofs of Lemmas 6.55, 6.56, 6.65, 6.66 and 
Ineorem 6.74. 

^This lemma is used in the proof of Lemma 6.68. 

This lemma is used in the proofs of Lemmas 6.56, 6.65, and 6.66 and Theo- 
rem 6.74. 



368 Chapter 6. Equi variance 

Lemma 6.56. 16 // A and p are related, then c g = \/c' g = d g - x for all 
g G G. Also c' g = Cg-i. 

Proof. Let / be integrable with respect to p. Use Lemma 6.55 twice to 
write 

J f(h)dp(h) = Jf(go g- 1 o h)dp(h) = c' g J f(g o h)dp(h) 

= c' g c' g ^ J f(h)dp(h), 

from which it follows that d y = l/d g . Next, use what we just proved and 
Proposition 6.53 and Lemma 6.55 to show that 

J fih-^dXih) = J f(h)dp(h) = d g J f(g o h)dp(h) 

= c' g J f(g o h~ l )dX(h) = d g J f([h o g- l ]- l )d\{h) 

= c' g c 9 J /(h-^dXih), 

from which it follows that c g = l/c' g . Then d g = c p -i follows trivially. □ 

Example 6.57. Consider Group 3, so that p(gB) = f gB (l/y)dxdy. Let g = 
(91,92) and transform by y = g2W, and x = #22 ~ Then J = p|) and 



JgB 

so 4 = 02- 



/ -dxdy= [ gl—dzdw = g 2 / —dzdw = g 2 p(B), 
J gB y Jb 92W J B w 



There is a large class of examples in which the two groups of transforma- 
tions G and G are isomorphic and are similar to both the parameter space 
and part of the sample space. We make this precise with the following 
condition. 

Assumption 6.58. Assume the following conditions: 

• The distributions {P e : 0 <E ft} are invariant under the actions of 
groups G and G. 

• LHM X and related RHM p on G exist. 

. The conditions of Lemma 648 hold for G, so that LHM and RHM 
are essentially unique. 

• The mapping <t> :G->G defined by (f>(g) = gisa group isomorphism. 



16 This lemma is used in the proof of Lemma 6.65. 



6.2. Equivariant Decision Theory 369 



• There is a bimeasurable mapping rj — > G which satisfies go rj(0) = 
rj(g0), for every g eG and 0 € ft. 

• There exists a bimeasurable function t : X -> G x y for some space 
y (where we write t{X) = {H,Y)), such that, for every g € G and 
xeX, t(gx) = {goh, y) if t(x) = (ft, y) . 

• For every 0, the distribution on G x y induced from Pq by t has a 
density with respect to A x v, where v is some measure on y. 

Note that the Y part of t(X) = (#, Y) is invariant when Assumption 6.58 
holds. Also, since the function t : X — ► G x y is bimeasurable, it will be 
convenient to assume that X = Gxy and X = Since t is one-to-one, 
the posterior of 0 given t(X) = (/i, y) will be the same as the posterior given 
X = t~ l (h,y). Similarly, we will let Pq stand for the induced distribution 
on G x y, and we will let fx\e(h>y\0) be the Radon-Nikodym derivative 
of Pq with respect to A x v. 

Theorem 6.59. Assume Assumption 6.58. Let H be an action space and 
L : ffc x N — + 2R be a loss function. Let G be a group of transformations of 
N such that L is invariant. Then, if the formal Bayes rule with respect to p 
exists, it is the MRE rule, it is MRE conditional on Y, and Y is ancillary. 

Before proving Theorem 6.59, here are some examples. 

Example 6.60. As we mentioned earlier, Pitman's estimator is MRE and it is 
the formal Bayes rule with respect to RHM on the location Group 1. Here G = JR 
and we can map X to G x y if we let y = JR 71 ' 1 and t{xi , . . . , x n ) = (x n , y) where 
y = (xi - x n , . . . , xn-i - x n ). Then t(gx) = (gx n , y). The loss L($, a) = (0 - a) 2 
is invariant. 

In the scale version, Group 2, G = IR + and we can let y = R n ~ l x {-1, 1} 
and G = {\x n \} so that t(x) = (|x n U), where y = (xi/|x n |, . . . ,x n /|x n |). Then 
H^) 2 = (9\Xnly). RHM is dx/x. If we use the invariant loss L(0,a) = {6 - 
a) /e , then Theorem 6.18 says that the MRE decision is indeed the formal 
Bayes rule with respect to RHM. Theorem 6.59 will also apply if the loss is 
L(0,a) = [\og(e/a)] 2 . 

With Group 3, we can write t(x) = (x (n) ,x (n) - x (1) ,y), where x (i) is the ith 
ordered element of x and 



X (2) ~ < t ^(n-1) -X 

X (n) ~ ^(1) ' ' X (n) - X ( 



1) / 



where tt is the permutation required to return the order statistic to the original 
data. Here y = JR n xU } where n is the set of permutations, and G = 1R x IR+ . 
Then t(gx) = (flf(x (n) , x (n) - x ( d), y). There are several invariant losses. Here are 
three: 

Ll {0,a) = N = ]R; 

L 3 (.,a) = K = ]RxIR + . 



370 Chapter 6. Equivariance 



The first is for location estimation, the second is for scale estimation, and the third 
is for simultaneous estimation of both. RHM has Radon-Nikodym derivative 1/cr 
with respect to Lebesgue measure. We can explicitly work through the normal 
distribution case. The likelihood function based on n observations is 

where w = £" =1 (x» -x) 2 . To find the posterior, we multiply by l/a and find the 
appropriate constant. If we let r = a~ 2 , then a = r" 1/2 and da = -r~ 3/2 dr/2. 
The posterior is, for some constant c, 



CT (n-2)/2 



exp (~[w + n(/4 - x) 2 ]) 



This has the form of the product of an N{x, l/(nr)) density times a T([n - 
l]/2, w/2) density. The posterior distribution of 

^Bi -X 

v^— ^— (6.61) 

is 7V(0, 1) given 6 2 . But since this distribution does not depend on G 2 , it is also 
the marginal posterior. Also, the posterior distribution of W/Q* is Xn-i» where 
W = Xir=i(^ - ^) 2 - These distributions parallel the prior conditional distri- 
butions of the sufficient statistics given 6. That is, prior to seeing the data and 
conditional on 6, (6.61) has N(0,1) distribution and W/Bl ~ Xn-i- The pos- 
terior distributions were named fiducial distributions by Fisher (1935), because 
they seem to fall right out of the conditional distributions given 0 without any 
need for a prior on 6. The quantity in (6.61) and W/0% are called pivotal quan- 
tities. These will be special cases of a more general result that will come later 
(Corollary 6.67). 

The proof of Theorem 6.59 will proceed through a series of lemmas. 
Lemma 6.62. Assume Assumption 6.58. For every 9 £ fi and every g € G, 

fx\e(h,y\8) = f X \e(goh,y\gO), a.e. [X x v). (6.63) 

Proof. Let B e A be arbitrary. Since P' e (X e B) = P+ e {X e gB), we 
have, for every g e G and every 9 e ft, 



II 



I B (h,y)f xle (h,y\0)dX(h)du(y) 
J J I gB {h,y)f x ^{h,y\ge)d\{h)d V {y) 
J j I B {9- l °h,y)f x]Q {h,y\90)dKh)di>{v) 
J j I B (h, v)fx\e(9 0 h, y\§9)d>.(h)Mv), 



6.2. Equivariant Decision Theory 371 



where the last equality follows from Lemma 6.47. Since this is true for all 
B e A, the integrands of the first and last lines must be equal a.e. [A x i/]. 



A simple corollary to this result is obtained by letting g = <j)~ l {n{6)~ 1 ) , 
where <j> and 77 are defined in Assumption 6.58. 

Corollary 6.64. Assume Assumption 6.58. There exists a function r : 
G x y — > 2R such that, for every ^efi 



The formula given in the statement of Corollary 6.64 is particularly cum- 
bersome due to the use of the notation (»;(•)). In fact, some of the 
proofs below would be almost unreadable if we continued to use this no- 
tation for the sake of mathematical precision. For this reason, we will take 
the following liberty with the notation for the_remainder of the proof of 
Theorem 6.59. We will pretend that ft = G = G so that <j> and 7? are just 
identity transformations, and we will not have to put the bar over elements 
of G. This should not cause any confusion, since the sets really do behave 
virtually identically. For example, Corollary 6.64 now says 



The following lemma will be useful both here and later. 

Lemma 6.65. 17 Under Assumption 6.58, Y is ancillary and the posterior 
density of 0 with respect to RHM is 

fe\xMh,y) = c H f HlYt e(h\y,i/>), 
where the second factor on the right is the conditional density of H given 



PROOF. Since Q = G, there is only one orbit in the parameter space, hence 
the maximal invariant is constant. Since Y is invariant, Lemma 6.39 part 
4 shows that Y is ancillary. 



This immediately implies (6.63). 



□ 



fxie(h,y\e) = r((f>- l (v(6r l ) oh y y) } a.e. [X x u}. 



fx\e(h,y\6) = r(0- l oh )y ), a.e. [A x v\. 





17i 



This lemma is used in the proofs of Lemma 6.66 and Theorem 6.74, 



372 Chapter 6. Equivariance 



where the second and fifth equalities follow from Corollary 6.64, the third 
follows from Proposition 6.53, the fourth follows from Lemma 6.55, and the 
sixth follows from the fact that Y is ancillary and from Lemma 6.56. The 
posterior density of 9 given X = (h,y) with respect to p is calculated via 
Bayes' theorem 1.31: 

, , v fx\e(h ,y\*l>) fx\e{h,y\il>) 

f ^ m ' y) = Mh tV ) = cUv(y) = * ° 



Lemma 6.66. Assume the conditions of Theorem 6.59. If rj is an equiv- 
ariant rule, then the conditional risk function given Y = y (constant as 
a function of 6) equals the posterior risk given X = (h,y) (constant as a 
function ofh). 

Proof. Since fi is isomorphic to G, there is only one orbit and the risk 
function will be constant for every equivariant rule by Lemma 6.39, part 
4. Also, the conditional risk function given Y will be constant in 0. The 
posterior risk given X = (h',y) is 

J L(6,r)(h',y))fe { x(8\h',y)dp(6) 

= c h » J Wh'ri(e,y))fH\e,Y(h'\0,y)dp(e) 

= j L(h>- l e,-q{e,y))f x ^{ti ,y\e)dp{9) 

jY{y) J 

= W) / W'^^yMWT 1 °0] _1 ,2/)<W) 

= j^jj j L(e, V (e,y))r(e-\y)dp(0) 

= I L(e-\-n(e,y))r(0,y)d\(O) 

= -r^r f L{h-\r){e,y))f x]e {h,y\e)dKh) 

= J L(e, n(h, y))f H \Y,9(h\y, e)dX(h) = R(e, r,\y), 

where the first equality follows from Lemma 6.65 and equivariance of r], 
the second and eighth follow from invariance of L and the definition of 
conditional density, the third and seventh follow from Corollary 6.64, the 
fourth is elementary group theory, the fifth follows from Lemma 6.55 and 
Lemma 6.56, the sixth follows from Proposition 6.53, and the ninth follows 
from the definition of conditional risk function. Q 



6.2. Equivariant Decision Theory 373 



There is a useful corollary to Lemma 6.66. 

Corollary 6.67. Under Assumption 6.58, the conditional distribution of 
Q" l H given Y = y is the same as the posterior distribution of@~ l H. 

Proof. Let N = G, and for each B e r = T, let L(9 } a) = Ib(0~ 1 Q<) in 
Lemma 6.66. The conclusion is that P f e (Q~ l H e B\Y = y) = Pr(9 _1 # e 
B|(£T,r) = (/i,»)) * □ 

The quantity 0 1 H is called a pivotal quantity because we can switch 
back and forth between thinking of H or G as being the random variable 
and the other as fixed without changing the distribution. The common 
distribution is called the fiducial distribution by Fisher (1935). 

Lemma 6.68. Assume the conditions of Theorem 6.59. Assume that the 
formal Bayes rule with respect to p exists. Let d(y) minimize the pos- 
terior risk if (c,y) is observed, where e is the identity in G. That is, 
min a / n L ( e , a)fe\x(P\e, y)dp(0) occurs ata = d{y). Define 6(h, y) = hd(y). 
Then 6 is the formal Bayes rule, and it is equivariant 

PROOF. First, note that 6 is equivariant since, for g e G, 

V)) = % oh,y)= gThgd{y) = ghd(y) = g6(h, y). 

To see that 6 is the formal Bayes rule, assume that (ft, y) is observed. 
We must show that min a J Q L(0, a)fe lx (0\h, y)dp{9) occurs at a = 6(h, y). 
We can write 

/ L{0,a)f Q \ X {e\h,y)dp(0) 

= mJ) j u L{ - e ^~ h ~ la ) r i 9 ~^y)M0) 
= 1^) Su L{6,h ~ Xa)MeMe)dm 

= I L(0,h- 1 a)f elx (9\e,y)dp(e), 
J a 

where the second equality uses Corollary 6.64 and the invariance of the loss 
function, the fourth equality follows from Lemma 6.54, the fifth follows from 
Corollary 6.64, and the sixth uses the fact that c e = 1. Using the definition 
ot d, the last integral above is minimized when h^a = d(y), that is, when 
a = hd{y) = 6(h,y). D 



374 Chapter 6. Equivariance 



Lemma 6.69. Assume the conditions of Theorem 6.59. The 6 defined in 
Lemma 6.68 is MRE and MRE conditional on Y. 

Proof. Prom Lemma 6.68, we know that the posterior risk given X = (e, y) 
is minimized for each y at the action d(y). By equivariance, and the fact 
that the posterior risk given X = (h,y) is constant in h (Lemma 6.66), 
it follows that the posterior risk is minimized at the action 6(x). Since 
Lemma 6.66 also shows that the risk function equals the posterior risk, the 
risk function is also minimized at 6, hence, 6 is the MRE rule conditional 
on Y = y. The unconditional risk function of a rule 77 at 0 = e is 



i?(e,r/) = J R(e,r)\y)f Y (y)dv(y). 



Since 6 has minimum conditional risk function uniformly in y, the uncon- 
ditional risk function of 6 is clearly the minimum also. Hence 6 is also the 
MRE rule. □ 
The conditions of Theorem 6.59 are often met when y is the space of 
maximal invariants. 

Example 6.70 (Continuation of Example 6.60; see page 369). Suppose that 
Xi, . . . ,X„ are IID given 6 = 0 each with density f([x - 0i]/02)/02, for some 
density /(•). These distributions are invariant under Group 3. There are many 
possible invariant losses, as we saw earlier. The y we calculated earlier was the 
space of maximal invariants. The MRE for loss L\ is 



If / is the standard normal density, then tfi(x) = x and 



6i{x) = 
The MRE for loss L 2 is 

6 2 (x) = 



r(f) rw 

It may be the case that all equivariant rules are inadmissible. For ex- 
ample, with Group 4 and one n-dimensional normal observation X , the 
MRE rule is to estimate 9 by X. But we saw in Section 3.2.3 that this is 
inadmissible if n > 3. 

In later sections, we will see Theorems 6.74 and 6.78, which are like 
Theorem 6.59. The conclusions to those theorems say that certain formal 
Bayes inferences with respect to RHM priors agree with classical inferences 
conditional on the ancillary Y. This is why, in Theorem 6.59, we also showed 



6.3. Testing and Confidence Intervals 375 



that the MRE decision rule is MRE conditional on Y. Theorems 6.59, 6.74, 
and 6.78 parallel each other more this way. 

Sometimes, the conclusions of Theorem 6.59 hold even when its condi- 
tions are not strictly met. For example, suppose that there is a nuisance 
parameter. It may be the case that for each fixed value of the nuisance 
parameter, the conditions of Theorem 6.59 apply to the problem with the 
appropriate subparameter space. 

Example 6.71. Suppose that Xi ,...,*„ are conditionally independent with 
N{ii, c*) distribution given 6 = (/i,<r). Let N = 1R and L(0,a) = (/i - a) 2 . This 
loss is not invariant under Group 3; however, it is invariant under Group 1. But 
the parameter space is not isomorphic to Group 1. For each value of cr, consider 
the subparameter space Q a = {(/z,<r) : fi e JR}. The formal Bayes rule with 
respect to RHM on Q a is 6{x) = x for each a. Since 6 is the MRE rule for each 
<r, it is the MRE rule under Group 1 for the original problem. 

It is not difficult to show that the situation of Example 6.71 generalizes 
to the following result. 

Proposition 6.72. Suppose that the parameter space is fi = Q x x fi 2 . 
Suppose that, for each 0 2 eQ 2f the conditions of Theorem 6.59 hold when 
^1 x {02> is taken as the parameter space. Then 6 is the MRE rule if and 
only if it is MRE for each of the subproblems with fixed values of 0 2 . 

There are situations in which there is no MRE. 

Example 6.73 (Continuation of Example 6.71). This time, let N = [0,oo) and 
Lift, a) - {a -a) /a . Let the group be Group 2. For each value of /x, consider 
the subparameter space n M = {( M ,<r) : <j > 0}. The formal Bayes rule with 
respect to RHM on is 6(x) = ^(s, - M )V(n + 2). No single equivariant 
rule achieves the minimum risk for each fj,. 



6.3 Testing and Confidence Intervals* 
6.3.1 P- Values in Invariant Problems 

In Section 4.6, we introduced P-values as an alternative to testing hypothe- 
ses at preassigned levels. In Examples 4.146 (page 281) and 4.61 (page 241) 
we saw that sometimes the P-value relative to a collection of tests is the 
same as the posterior probability that the hypothesis is true based on an 
improper prior. A more general situation in which P-values correspond 
to posterior probabilities with improper priors arises when there is equiv- 
anance with respect to some group operating on the data and parameter 
spaces. The structure of the problem will need to be very much like that 
of Theorem 6.59. In addition, we will need to say something about the hy- 
potheses of interest and how they interact with the group operation. We 



This section may be skipped without interrupting the flow of ideas. 



376 Chapter 6. Equivariance 

also need to choose an appropriate set of tests with respect to which we 
calculate the P- value. 

Theorem 6.74. Assume Assumption 6.58 (see page 368). For each 0 G ft, 
let Qe be a subset of ft such that the following conditions hold: 

1. Oe n e; 

2. for all g eG and all 6 G ft, gild = Qgo; 

3. for all 9 eCl and all ip G ft#, ft</> £ ft#; 

4- for alld G ft and all h G ft e (where e is the identity in G ), ft#/i C ft#. 

For each 6 G ft, let G index a set of tests {<t>e,g • 9 £ G} of the hypothesis 
He : 0 G ft# versus As : 0 £ ft# defined by 



Suppose that we use p as a (possibly improper) prior for 0. The posterior 
probability that He is true given t(X) = (h,y) is equal to the conditional 
P -value given Y = y relative to the set of tests {<j>e } g : g £ G}. 

It should be noted, in the statement of Theorem 6.74, that the P-values 
must be calculated conditional on Y . But, Lemma 6.65 says that Y is ancil- 
lary. So, those who believe in conditioning on ancillaries would then want 
to calculate P-values conditional on Y anyway. Theorem 2.48 says that if 
there is a boundedly complete sufficient statistic, it will be independent of 
the ancillary. This leads to a simpler version of Theorem 6.74. 

Corollary 6.75. Under the conditions of Theorem 6.74, if H (the "group" 
part oft(X)) is a boundedly complete sufficient statistic, then the posterior 
probability that He is true equals the P -value relative to the set of tests 
{(t>e, 9 -g^G}. 

Before proving Theorem 6.74, some explanation of the four conditions 
on fta is in order. The first condition is simply to connect 6 with the 
corresponding hypothesis in a sensible way. The second condition ensures 
that the hypotheses are "equivariant" in some sense. The third condition 
is to ensure that the P-value is the size of the test <\>e, g when H = g. The 
fourth condition guarantees that the size of <f>e, g as a test of H e is equal to its 
power at 6. These last two conditions also capture the "one-sided" nature 
of the types of hypotheses to which this theorem applies. It will not apply 
to point hypotheses or to hypotheses such that ft<? has smaller dimension 
than ft. The reason that the form of the test must be tied so closely to the 
form of the hypotheses is that there may be many classes of "equivariant" 




6.3, Testing and Confidence Intervals 377 



tests 18 and each class may lead to a different P-value. However, there is 
only one posterior probability that 0 € Qe with respect to RHM. Hence, 
we needed to identify exactly which class of tests has P-value equal to that 
posterior probability. This point will become clearer after Theorem 6.78. 

As in the proof of Theorem 6.59, we will assume that X = t(X) and that 
G = G = fi, to make the notation simpler. The following lemma is also 
useful. 

Lemma 6.76. Under Assumption 6.58, the conditional distributions of the 
H part of X given Y are invariant 

Proof. Let B be a measurable subset of G, let g € G, and let \x be the 
probability measure that gives the marginal distribution of Y. Define 

v{e,g,y) = P f e {gHeB\Y = y). 

We want to prove that Pi $ (H € B\Y = y) = v(9,g,y), a.s. [/a] (for fixed 6 
and g). For every measurable subset A of y, 



L 



v(9,g,y)d f i(y) = P^Y e A, gH e B) 



= Pi e (YeA,HeB) 

= ! Pi e (HeB\Y = y)d M (y). □ 

J A 

Proof of Theorem 6.74. Let 9 e U and let V e JV Then 0~V e fi e 
by condition 2. Also, we use conditions 2 and 4 to show that 

P;(<l>e, g (H, Y) = l\Y = y) = e 0fij_\|r = y) 

= Pim € 0^ |F = y) = P e '(# e v -1 ^ |r = 

= P' e {H~ l € n 8 -.tf-V|y = < Pe'C^- 1 6 n 9 -.|y = tf ) 
- W^(/r,y) = i|y = »). 

This shows that the conditional size of the test <^, g (given K) as a test of 
Ho is equal to its conditional power function at 9. For each g€G, define 

Q(g,y) = PL(Hen;} 1 \Y = y ). 

Then P' g {H e WlJ^ir = y) = Q(g,y), and it follows from what we just 
proved that the conditional size, given Y = y, of as a test of tf e equals 

^t 1 ?. the WOrd Ue< l uivariant " ^ Quotes because it is not the test function 
itself that is eqmvariant, but rather the combination of the hypothesis and the 
test function. That is, if fa is a test of Q g , then fa (h,y) = ^ e {gh,y) 



378 Chapter 6. Equivariance 

Q(g,y). The conditional P- value given Y = y can then be calculated as 
p(h,y) = inf{Q(^,2/) : <t>e y g{Ky) = 1}. 

It is easy to see that (j>o,e- l h{h,y) = 1 by condition 1. It follows that 
p(h,y) < Q(0~ 1 h 1 y). Next, suppose that <t>o, g {h,y) = 1. It follows that ft € 
hence ft _1 0 e Sl g -i. Condition 3 implies that (lh~ l o Q ftp-i, from 
which it follows that Q{9- l h,y) < Q{g,y). It follows that Q(0~ l h,y) < 
p(h,y), hence Q(0~ l h,y) = p(ft,y). 

To complete the proof, we calculate the posterior probability that Hq is 
true given X = (ft, y) and show that it equals Q(0~ 1 h, y). Lemma 6.65 tells 
us that 

fe\x(1>\h,y) = c h J mY ,e(h\y,ilj>). 

In the following equalities, let (H',Y f ) have the same conditional distribu- 
tion given 9 that X had before it was observed: 

Pr(6 e Q e \X = (ft,?/)) = c h J In e Wf H \Y t e( h \y^) d PW 
= J In e (h^- 1 )r(^y)dXW/f Y (y) = |^- ie Wr(^y)dAW//y(y) 



/ 



J n -i (ff)/H |r,e(j|y, c)dA( S ) = P^H' e = y) 

h~ 1 e 



= Q{0- l h,y), 

where the first equality follows from Lemma 6.65, the second equality fol- 
lows from Corollary 6.64, the third follows from Proposition 6.53, the fourth 
follows from Lemma 6.55, the fifth is just algebra, and the sixth follows from 
Corollary 6.64. D 

Example 6.77. Let X u • • • , X n be conditionally IID with N{n,a 2 ) distribution 
given 0 = {n, a). The group is location and scale (Group 3). Consider the hy- 
potheses Q e = {{a,b] ) e Q : a < /x} for 0 = (/i,<r). The corresponding tests 
are the usual one-sided *-tests. (The reader should check that the conditions of 
Theorem 6.74 are satisfied.) The associated P-values equal the posterior prob- 
abilities that the hypotheses are true if the prior is RHM, the measure with 
Radon-Nikodym derivative 1/cr with respect to Lebesgue measure. 
As a less familiar example, let 

Q 0 = {(a, 6) e Q : a > jx, 6 < <r}. 

This is a simultaneous test of H^ a : M > /x and E < <r. We will check condition 
4 only. Since e = (0, 1), ft = (m, s) € ft e satisfies s < 1 and m > 0. For such ft, 



= |(a,6) en:6<as,a>/i+^} Cft, 



6.3. Testing and Confidence Intervals 379 

since as < a and \i + bm/s > \i. Suppose that data (x n ,Sn) are observed with 
x n = ^=1 Xi l n and Sn = \/!Er=i( Xi " ^n) 2 /(n - 1). The test </> 0tg rejects 
#m,<t if x n < /i 4- s n gi/g2 and s n > cr#2. The P- value is the size of the test 

00,([x n -/x]/<r,W*)- 

6.3.2 Equivariant Confidence Sets 

In Section 5.2.1, we introduced confidence sets as an alternative to testing 
a single hypothesis about a parameter. In Example 5.57 on page 319, we 
saw that the confidence coefficient may not adequately express our degree 
of confidence that the parameter is in the set after seeing the data. That 
example is one in which the distributions are invariant under the action of 
the location group on the real numbers and the group is isomorphic to the 
parameter space. In addition, the sufficient statistic (T\,T2) can be trans- 
formed to (7i,T 2 - Ti) so that the group acts on T\ and leaves T2 - T\ 
invariant. This is the same situation that arose in Theorems 6.59 and 6.74. 
In Theorem 6.74, we saw that posterior probabilities agreed with P-values 
conditional on the ancillary (invariant). A similar thing happens in Ex- 
ample 5.57, namely posterior probabilities (with respect to an improper 
prior) agree with conditional confidence coefficients. This is a special case 
of another theorem with conditions similar to the other two. This theo- 
rem is similar to one proved by Stein (1965). Chang and Villegas (1986) 
prove a similar theorem with slightly different conditions. Berger (1985, 
Section 6.6.3) also proves this theorem in a different way. Jaynes (1976, 
p. 181) gives a proof for the case of a location parameter. 

Theorem 6.78. Assume Assumption 6.58 (see page 368). For each x G X, 
let B x be a measurable subset of ft satisfying B gx = gB x for all g G G. Let 
Co = {x : 0 G B x }. Suppose that we use p as a (possibly improper) prior 
for 6. Then, for all x G X and all 8 eft,, 

Pr(0 G B X \X = x) = P^(X G C e \Y = y). (6.79) 

Proof. As in the proofs of Theorems 6.59 and 6.74, we will assume that 
X = t(X) and G = G = ft for ease of notation. Hence, we will write 
x = (h,y). Now, write B( hyy ) = /iJ5( e>y ), and use Corollary 6.67 to say that 

P^(e- l H G B^ y) \Y = y)= PriQ-'H G B^ y) \(H,Y) = 

where B^ y) = {g : g~ l G B^ y) ). Since 0~ 1 h G B^ y) if and only if 
6 G B^ y ) if and only if (ft, y) G C$, the result follows. □ 

If we think of S(X) = Bx as a confidence set, then the left-hand side of 
(6.79) is the posterior probability that O is in the confidence set and the 
right-hand side is the conditional confidence coefficient given the ancillary. 

At this point, we should examine the connection between Theorems 6.74 
and 6.78. Since confidence sets and tests are equivalent, one would expect 



380 Chapter 6. Equi variance 



there to be some sort of equivalence between these two theorems. The 
problem is that Theorem 6.78 applies to all equivariant confidence sets. 
All such collections of confidence sets correspond to collections of test. All 
such tests satisfy the "equivariance" condition \l)e{h,y) = ip g o(gh,y) if ipe 
means the corresponding test of ft#. Furthermore, every such collection of 
tests leads to a P- value. Each such P- value will be the posterior probability 
of some set in the parameter space. That set may not equal ft#, however. 
Here is how it works. For each a, suppose that we choose our B X)Q so that 
Co, a = {x : 9 e B x ,a} satisfies P' e {X € Co, a \y = y) = 1 - a. This makes 
Bx,a a conditional coefficient 1 — a confidence set given Y. Now define 
tests i/)0 tOt to be 1 minus the indicator functions of the sets Ce, a . Then the 
power function of satisfies 0%f, 0ta (6) — a, and ^ >a is a level a test of 
the hypothesis 

too = {0' e ft : ^ a > for a11 Q }- 

The conditional P- value relative to the set of tests B# = {ipe.a ' <* e [0, 1]} 
is 

p(x) = inf{a : x G C£ a }. 

It follows that P' e (X e C£ p{x) \Y = y) = p(x). From (6.79), we conclude 
that 

Pv(eeB^ p{x) \Y = y)=p(x). 
In general, it might happen that B% p ^ ft*. Here is an example. 

Example 6.80. Let Xi, . . . , X n be conditionally IID with N(/j,,a 2 ) distribution 
given 9 = (Mj* 7 )- Here (X n ,S n ) is a complete sufficient statistic, where X n — 
EILi X i/ n > S " = \f^=i( X i-^n) 2 /(n-l). So, we will ignore the Y part of 
the problem, since Y is independent of the complete sufficient statistic. Let 



Bx,a = 



Xn ~ ^ SnTn ' 1 (I ) ,Xn + ^ SnTn " il (2) 



where T~\ is the inverse of the CDF of the t n -i(0, 1) distribution. Then, is 
the usual two-sided size a t-test of # : 6 = 0. The P- value is the a value p Q (x) 
such that one of the endpoints of the interval B x ,a equals 0. This makes_B x , Pe ( x ) 
equal to the interval centered at X n and having half- width equal to \X n - 0\. 
On the other hand, Q e = The P-value is the posterior probability of some 
hypothesis, but not the hypothesis you thought you were testing. 



6.3.3 Invariant Tests* 

In multiple parameter problems with hypotheses concerning several pa- 
rameters at once, there may be many competing tests, none of which is 



L9 See Problem 24 on page 392. 

"This section may be skipped without interrupting the flow 



6.3. Testing and Confidence Intervals 381 



UMPU. Just as we used equivariance to reduce the collection of estima- 
tors to consider, we can try to reduce the number of tests to consider also. 
In hypothesis testing, the action space is N = {0,1}. There are only two 
groups that act on this set. One contains only an identity, while the other 
contains an identity and a "switch" operator, g(i) = 1 — i. If we were to 
construct groups such that G were this second group, then there would 
have to be conditions under which we were willing to switch the hypothesis 
and the alternative. Due to the asymmetric treatment of hypotheses and 
alternatives in classical testing theory, this would not be advisable. Hence, 
we will only discuss cases in which G consists of one element, namely an 
identity. Then a decision rule is equivariant if and only if it is invariant. 
That is, since tests are randomized rules, 

6*( 9 x)(~gA)=6*(gx)(A) = 6*(x)(A), 

making 6*(gx)(-) the same probability as (&)(•). So, each equivariant 
(invariant) test must be a function of the maximal invariant. 

Example 6.81. Consider Group 3, namely one-dimensional location-scale. The 
maximal invariant is 

I Xl ~% x n — x\ 

where w = ( Xi -x) 2 . Nobody would ever base a test on this alone, because 
it is ancillary in location-scale problems. In the normal distribution case, it is not 
even a function of the sufficient statistic. 

If we first consider the sufficient statistic (X, W), we see that the maximal 
invariant is constant, hence only constant functions of the sufficient statistic are 
invariant. 

This example raises the question of whether reduction of the set of tests 
by invariance is compatible with reduction by sufficiency. That is, suppose 
that we first reduce to the set of invariant tests and then find a sufficient 
statistic for the maximal invariant parameter and further reduce by con- 
sidering only invariant tests that are a function of the sufficient function of 
the maximal invariant. Will we get the same tests as we would if we first 
reduced to only those tests that depend on the sufficient statistic and then 
reduced to only those that depend on the maximal invariant in the space 
of sufficient statistics? In Example 6.81 on page 381, the answer is yes, but 
only because both methods produce degenerate results. Hall, Wijsman, and 
Ghosh (1965) find conditions for this compatibility. 

The following assumption is an obvious preliminary. It requires that the 
group operation is inherited by the sufficient statistic space. 

Assumption 6.82. If T{X) is sufficient and, for each g € G, we define 
T g (x) = T(gx), then T g {x) depends on x only through T(x). 

Example 6.83. Suppose that X = IR n and T(x) = (x,w), where w = ^ =1 (xi- 
x) 2 . Then g a , b x = (..., b Xi + c, • • .) and T(g a , b x) = (bx + a, b 2 w). This function 



382 Chapter 6. Equivariance 



satisfies Assumption 6.82, assuming it is sufficient. A function that does not 
satisfy the assumption is H{x) = x\X2- 

If Assumption 6.82 is satisfied, define g* to be the transformation on 
T g*t = T g (x) for any x such that T(x) = t. The set G* of all such 
transformations is a group. Let U : T — ► U be the maximal invariant in 
the sufficient statistic space, and let V : X — * V be the maximal invariant 
in the original data space. Then 



So, U(T(-)) is an invariant function in the original data space; hence it is 
a function of V. That is, there exists H : V -> U such that U(T(x)) = 
H(V{x)) for sllxe X. 

Theorem 6.84 (Stein Theorem). 20 Let T be sufficient and satisfy As- 
sumption 6.82. Suppose that T has discrete distribution. Let U and V be 
maximal invariants in T and X, respectively. Let R(S) be the maximal 
invariant in ©. Then U(T(X)) is sufficient for R(Q). 

Proof. The proof proceeds through a series of claims. 

(a) A = V~ l {B) for some B C V if and only if gA = A for all g. 
(Proof of a): Let A = V~ l (B). Then 

gA = {gx : V(x) G B) = {x : V(g~ l x) € B} = {x : V(x) e B} = A, 

since V is invariant. Now, let gA — A for all g. Then 



so Ia(-) is invariant and it must be a function of the maximal invariant, 
namely I A (x) = f(V(x)). Let B = /" 1 ({1}). Then I A {x) = I B (V(x))> so 
A = {x : V(x) G B}. A set ^4 that satisfies the conditions of this claim is 
called an invariant set. 

(b) Pr(X G A\T = t) is an invariant function of t if A satisfies gA = A 
for all geG. 

(Proof of b): Choose any 6 and t such that Pr(T(X) = t\S = 0) > 0. Let 
£ G G. 



(Tfos)) = f/(T,(x)) = P(,'t) 



U(t) = f/(T(x)). 



Pr(X G j4|T(X) = t) = 



Pr(XG A,T(X) = t\e = d) 

Pv{T(X) = t\Q = 6) 
Pt(X egA,T{X) = g*t\e = gO) 
PT(T(X) = g*t\e = gO) 



= Pt(X e gA\T(X) = g't) 
= Pr(X€i4|T(X) = s*t), 



20 Hall, Wijsman, and Ghosh (1965) attribute this theorem to Stein. 



6.3. Testing and Confidence Intervals 383 



where the second- to-last equality holds by sufficiency of T. 

(c) If A is an invariant set, then Pq(A\U{T(X)) = u) is constant in 6 for 
each u. 

(Proof of c): Write P e {A\U{T{X)) = u)as 

Pr (* € A\T(X) = t) Pt(T(X) = t|l/(T(X)) = u, 6 = 0). 

{teu-Hu)} 

Since U is maximal invariant, Z7 -1 (u) is an orbit in T. So, U~ l (u) = {£ : 
* = for some G T. It follows that P$(A\U = u) equals 

^ Pr(T(X) = <>** u |[/ = u, 9 = fl) Pr(X e A|T(X) = 
g*€G* 

The last factor equals Pr(X £ A\T = t u ) by (b), so it factors out of the 
sum. Also, 

{x:U(T(x)) = u} = (J {x:T{x)=g*t u }, 
g*eG* 

so the remaining sum equals 1 and 

V e {A\U = u) = Pr(X e A\T(X) = t u ), 

which is the same for all 0. 

(d) Part (a) says that the invariant sets constitute the a-field generated 
by V. Part (c) says that for each A in that a-field, P' e (X € A\U = w) is 
constant in 0. Hence is sufficient. □ 

Hall, Wijsman, and Ghosh discuss conditions under which the Stein the- 
orem 6.84 holds for continuous distributions. 

Example 6.85. Consider Group 3 again. The maximal invariant is 

V(X) - ( Xl ~ Xn Sn-2 ~ X n . , A 

(X) ~ V^-!-X n ' " * ' ^^7^' Slgn( ^- 1 " ' 

which is independent of 8. So the sufficient statistic in the maximal invariant 

^^SSn^S in the sufficient statistic s P a <* * also 

SZS?; T ^ 1S .° nly ° ne ° rbit in the Parameter space, the maximal 
invariant in the parameter space is also constant. In simple English, Group 3 
equivariance is useless in hypothesis testing. P 

Definition 6.86. A function / on X is almost invariant with respect to [i if 
for each g e G, there exists B g e B such that p(B g ) = 0 and f(x) = /(^) 
tor all x £ B g . 

Proposition 6.87. J/P* « ^ for each 0, ifv(9) is maximal invariant in 
11, and if f is almost invariant with respect to p, then the distribution of 
f{X) given 6 = 0 depends on 6 only through the v(0). 



384 Chapter 6. Equivariance 



The proof of this is very similar to the proof of part 4 of Lemma 6.39. 

Definition 6.88. A test 0 is UMPU almost invariant (UMPUAI) level a 
if it is UMPU among all almost invariant level a tests. 

Theorem 6.89. Suppose that Pq «C /x for each 0 and a hypothesis-testing 
problem is invariant under G and G. Suppose that there exists 0*, which 
is UMPU level a and 0* is unique a.e. [/z]. Suppose also that there exists 
0o, which is UMPUAI level a. Then 0o w also unique a.e. [fi] and 0o = 0* 
a.e. [/i]. 

Proof. Let U a be the class of all unbiased level a tests. First, we show 
that 0 e U a if and only if 4> g G U a for each g, where (j> g (x) ~ 4>(gx): 

Ee<t> 9 (X) = Ee<t>(gX) = E^0(X), 

which is greater than or equal to or less than or equal to a, respectively, 
according to g6 e fin or gO € J1a, according to 0 € H or 0 e A by 
invariance. This makes (f> 9 unbiased level a. 

Next, we show that 0* is UMP in t/ Q . Since 0* 6 U Q , we have that 
0* e U Q by the first result. Let 0 e ft A - Then 

Eop g (X) = E e <f>*(gX) = E g -e<t>*(X) 

= sup Eg G (j){X) = sup E e cj)(gX) 

<t>€U a <f>eu a 

= sup Eo(f) g (X) = sup E*0(X) = Ea0*(X), 
0€t/ Q </>et/ Q 

since 0* is UMP in £/ a . So, 0* = 0* a.e. [fi] for each # by the uniqueness of 
0*. This makes 0* almost invariant. Since 0 O is UMPUAI level a, 0<f> o {0) > 
fy* {0) for all 0 € Q^, so 0 O is also UMPU level a and 0 O = 0* a.e. [/x] also, 
and so it is also unique a.e. [/i]. a 
This theorem does not guarantee that the UMPUAI level a test is 
UMPU, but it provides insurance that if there is a unique UMPU level 
a test, we can find it by finding the UMPUAI level a test. 

One- Way Analysis of Variance 

Consider the one-way ANOVA (analysis of variance). That is, Yij are con- 
ditionally independent with Y {j ~ N(fii,a 2 ) given Mi = \i { for j = 1, . . . , n { 
and i = 1, . . . , k and E = a. First, reduce by sufficiency to 

Y^-YY^ fori = l,...,*,and W = £ f>y - ^i) 2 . 



6.3. Testing and Confidence Intervals 385 



Suppose that = {(cr, /i) : BAfi = 0}, where B is an r x k rank r matrix 
with r < k, A is the diagonal matrix 



A = 



\ 



0 \ 
0 



and 0 is the vector all of whose coordinates are 0. Without loss of generality, 
we can assume that B is the first r rows of an orthogonal matrix T. Let 
Y be the vector whose ith coordinate is Y \ for each i. Make the one-to- 
one transformation of the data to X = FAY, and W. Now, given the 
parameters, X is independent of W with 



X~AT fc ( 7 ,<7 2 J), W 
where d = n - k and 7 = TAfi. We can write 

ft// = {(7, cr) : 7i = ' ' * = 7r = 0}. 

Let the group G consist of triples (A,6,c), where A is r x r orthogonal, 6 
is (fc — r)-dimensional, and c > 0. Define 



->-K£.) 



,c 2 w 



where we write a; T = (xj ,x^) and #i is r-dimensional and #2 is (A; — r)- 
dimensional. In the parameter space 



SA,*,c(7.<') 



Y A 7l \ 
A 72 + O' 



CC7 



with 7 T = (7^, 7J) and 71 the first r coordinates, preserves the hypothesis. 
So the testing problem is invariant. 

The maximal invariant in X is determined by f(g(x, w)) = /(#, w) for 
all <7, x, and w. So, for fixed x and w, let A have first row proportional to 
xj, b = —X2, and let c = 1/y/w. Then 



f{9A,bA x > w )) = f 



\ 



\ 



,1 



= f(x,w). 



I 



So xjx\ /w is maximal invariant, since it is clearly invariant. The usual F 
statistic for testing H is just d/r times the maximal invariant, and it has 
noncentral F distribution iVCF(r,d, <5), where 6 = Ya=i 1i/ a2 conditional 
on the parameters. The hypothesis H is equivalent to 6 — 0. Since the 
noncentral F distribution has MLR in the noncentrality parameter (see 
Problem 29 on page 289), the F-test is UMPUAI level a. 



386 Chapter 6. Equivariance 



Multivariate Analysis of Variance 

We now present an example of a case in which the number of tests available 
is so large that even a reduction by invariance still leaves too many tests to 
consider. 21 Imagine that the data consist of exchangeable p-dimensional ob- 
servations Xi, . . . , X n . We will write the data matrix as M = [M1IM2IM3], 
where 

Mi = [X 1 1 • • • \X q ), M 2 = [X q+1 \---\X k ], M 3 = [X fc+ i I • • • |X n ], 

where n > k > q. The parameter is 0 = (Mi, . . . , M&, E), where each M* 
is a p-dimensional vector and E is a p x p positive definite matrix. The 
conditional distribution of the Xi given 0 = (E,/ii, . . . is that the 
Xi are independent with Xi having distribution iVp(/Zi,E) distribution for 
i < k and Xi having iV p (0, E) distribution for i > k. The hypothesis of 
interest is Mi = • • • = M q = 0. 

The group we choose for this problem comes in four parts: 

^1 = {9a A '\s & p x k - q matrix}, 

Q 2 = [g 2 D : D is n-kxn-k orthogonal}, 

£3 = {g% : C is q x q orthogonal}, 

^4 == {.9e '• E is p x p nonsingular}. 

These groups are applied in sequence as follows: 

9a,d,c,eM = \EM X C\M 2 + A\EM 3 D). 

The action on the parameter is 9a,d,c,e® equal to 

(EEE T ,E[Mi\ • • • |MJC, [M, + i| • • • |M*] Hh i4,£?[M fc+ i| • • • |M n ]D). 

Note that the hypothesis is not altered by action of g. That is, 

Mi = • • • = M q = 0 if and only if E[M X \ • • • |M q ]C = [0| ■ • • |0]. 

To find the maximal invariant, we set 

f(M) = f{g A ,D,c,EM), for all A D,C,E,M. 

In particular, suppose that A = -M 2 , then /(M) = /([£MiC|0|£M 3 £>]), 
where (9 is matrix of all zeros. Now, consider the following lemma. 

Lemma 6.90. Two axb matrices R and T satisfy RR T = TT T if and 
only ifT - RQ for some bxb orthogonal matrix Q. 

21 For a good introduction to invariant tests in multivariate problems, see An- 
derson (1984, Chapter 8) or Kshirsagar (1972, Chapters 7-10). 



6.3. Testing and Confidence Intervals 387 



Proof. First, suppose that T = RQ, then TT r = (RQ)(RQ) T = RR T . 
Next, suppose RR T = TT T . Write the singular- value decompositions of R 
and T as 

R = T R A R n R , T = T T K T Sll , 

where Yr,Yt,£Ir, and Q,t are orthogonal and Ar and At are "diago- 
nal" matrices arranged so that the absolute values of the diagonal entries 
increase as you read down the diagonal. (The A matrices are not really 
diagonal because they are not square. Their only nonzero entries are (1,1), 
(2,2), etc., however.) Then 

rr t = TrArA^Tr = r T A T Af r T = tt t . 

Since these are two representations of the eigenvalue decomposition of the 
same matrix, it follows that F T = Y R and A T = A R J, where J is a diagonal 
matrix with only ±1 in each diagonal entry. (If RR T has eigenvalues with 
nonunit multiplicity, a permutation of the columns of I> may be required 
to make it equal to T R .) So, T = Rfl R jnJ. Since fi fl Jf2f is orthogonal, 
it follows that T = RQ, where Q is orthogonal. * □ 

Now, let Ml be a p x q matrix such that M X M^ = M 1 *M 1 * T , and let 
C be orthogonal such that Aff = M X C. It follows that f([Mi\M 2 \M 3 ]) = 
f([M;\M 2 \M 3 }). Similarly, if M 3 * is such that M 3 M 3 T = M 3 *M 3 * T , then 
f([Mi\M 2 \M 3 ]) = f{[Mi\M 2 \MS\). It follows that / is a function of M x 
and M 3 through M x Mj and M 3 M 3 T only. Define g(B, W) = /([Mj|C>|M 3 ]), 
where B = MiM? and W = M 3 M 3 T . It follows that f(M) = g(B, W) if / 
is invariant. Also, f(g A ,c,D, E M) = g(EBE T , EWE r ). Finally, write the 
eigenvalue decomposition of 



W-iBW-i =r 



/ Ai 0 

0 
0 
0 

0 



0 



0 
0 



0 \ 

0 
0 
0 

0 

o J 



r T = rAr T , 



where s is the rank of B. Note that s = min{p, q) with probability 1. Then 
set E = nw-V2. It follows that f(M) = g(EBE T ,EWE T ) = g(A,I ) 
Note that A is invariant; hence it is maximal invariant. " 

What we have just proven is that every invariant test in MANOVA must 
be a function of the nonzero eigenvalues of W- l / 2 BW~ l l 2 , which are the 
same as the nonzero eigenvalues of W~ l B. A similar argument shows that 
the maximal invariant in the parameter space is the set of nonzero eigen- 
values of E *M, where M = [Mi | • • • iMJp^l • • • |M,] T . The two special 
cases in which s = 1 are of interest. If p = 1, we have univariate ANOVA 
and the only nonzero eigenvalue of W~ l B is q/(n - k)F, where F is the 



388 Chapter 6. Equi variance 



F statistic for testing if. If q = 1, then the only eigenvalue of W~ l B is 
Hotelling's T 2 . For cases in which s > 1, there is no UMPUAI test, but 
there are several well-known invariant tests based on the eigenvalues of 
W~ l B. One is based on the largest eigenvalue, another on the sum of the 
eigenvalues, and a third on the product of the nonzero eigenvalues. 

A Test Based on Tolerance Sets 

Let 0 = (M, E), and suppose that {X n }™ =l are conditionally IID with 
iV(/i,cr 2 ) distribution given 0 = Let X = (Xi, . . . ,X n ), and let 

V = Sril+i Xi/m. Suppose that we want to try to develop a test of the 
hypothesis H : V < c. First, we convert the hypothesis into a parametric 
hypothesis as in (3.15). For each 6 £ (0, 1), let 

n 6 = {0 = (ii,<T):Pl(V<c)>6}. 

We might wish to choose values of 6 and a and then require that for all 0 € 
£ls, Pe (reject H) < a. This means that we are trying to test H' : 0 £ fl$ 
at level a. We will use a version of the group described in Problem 11 on 
page 389. An element g a of the group acts on X by Q a {%>\, • • • > x n ) = (c + 
a(x\ — c), . . . , c+a{x n — c)). The maximal invariant in the sufficient statistic 
space is T = y/n(X—c)/S. 22 The maximal invariant in the parameter space 
is B = (M - c)/E. We know that V < c if and only if y/m(V - M)/E < 
- v/mB, and so P e (V < c) > 6 if and only if B < fc-^l - 8)/y/m. So, n 6 = 
{#:/?< /3b}, where (3 = {fi-c)/a and = ^"H 1 - <5 )/v / ^- So the test we 
seek is equivalent to # : B < /3 0 . The conditional distribution of T given 
B = (3 is noncentral NCt n -i(y/ri(3). This distribution has increasing 
MLR in the noncentrality parameter (3 (see Problem 29 on page 289). The 
UMP invariant level a test is to reject H if T is greater than the 1 - a 
quantile of the NCt n - i(y/n/3o) distribution. Let this quantile be denoted 
d. Then T > d is equivalent to c & [X - dS/y/n,oo), which in turn is 
equivalent to the test found in Example 5.73 on page 326. 

6.4 Problems 

Section 6.1.1: 

1. Prove Proposition 6.3 on page 346. 

2. Prove Proposition 6.5 on page 346. 

3. Let 9 be a location parameter for X, let N = ft, and suppose that L{0, a) 
is a function of 6 - a. Prove that the risk function of a location equivariant 
rule 6 is constant. 



22 The reader might wish to prove this in solving Problem 11 on page 



6.4. Problems 



389 



4. If 0 is a location parameter and Y = g{X) is location invariant, then prove 
that Y is ancillary. 

5. Suppose that X\ , . . . , X n are IID given G = 0 each with density 



(This is called the half-normal distribution.) Let L(0, a) = (0 — a) 2 and 
K = Q = IR. Let G be the one-dimensional location group, Qc (x l , . . . , Xn ) — 
(xi + c, . . . , x n + c). Find the MRE estimator. 

6. A function # : lR n — ► IR is even if g(— xi, . . . , — x„) = o(xi, . . . , x n ). A 
function a is odd if o(— xi, . . . , — x n ) = —g{x\, . . . ,x n ). Suppose that S is 
odd and location equivariant and that T is even and location invariant. 
Suppose that X\ , . . . , X n are IID with density / with respect to Lebesgue 
measure such that f(c — x) = /(c + x) for some c and all x. (Such a density 
is called symmetric about c.) Suppose that the variances of 5(Xi, . . . , X n ) 
and T{X\ , . . . , X n ) are both finite. Prove that the covariance between them 
is 0. 

Section 6.1.2: 

7. For each vector x = (xi,...,x n ), let k(x) denote the subscript of the 
last nonzero coordinate with fc(0, ...,0) = 0. Let xo = 1. Prove that a 
function u is scale invariant if and only if it is a function of x only through 
y(x) = (xi/\x k(x) \,...,x n /\x k{x) \). 

8. Suppose that 6q is scale equivariant and not identically 0. Prove that 6i is 
scale equivariant if and only if 6\ = u6o for some scale invariant u. 

Section 6.2.1: 

9. Prove Proposition 6.25 on page 354. 

10. Prove Proposition 6.29 on page 355. 

11. *Let X u • • • , X n be IID N(^a 2 ) given 0 = (/z,<r). Let N = {0, 1} and 



(a) Prove that the formal Bayes rule with respect to the improper prior 
with Radon-Nikodym derivative 1/a with respect to Lebesgue mea- 
sure is the usual level 1/(1 + R) t-test. 

(b) Let G be a group that acts on X as follows: 




L(0, a) = < 1 if /i < /xo and a = 0, 
t 0 otherwise. 




g c (xi , . . . , x n ) = (c(xi - /i 0 ) 4- /io, • • • , c(x n - Mo) + Mo) 



for c > 0. Find G and G so that this problem is invariant, and show 
that the £-test is equivariant. 



390 Chapter 6. Equi variance 



Section 6.2.2: 

12. Prove Proposition 6.43 on page 361. {Hint: The proof is very much like 
Example 6.42. There is no need to transform the data.) 

Section 6.2.3: 

13. *Let Q = (0,oo). Suppose that Xi, . . . ,X n are IID given 0 = 6 each with 

density 

/x x ie(*|*) = i/(f), 

for some density function /. Let G be Group 2 and N = [0, oo). Let L(0, a) = 
(0 r - a) 2 /0 2r , for some r > 0. 

(a) Find G so that this problem is invariant. 

(b) Characterize all equivariant rules. 

(c) Write a formula for the MRE rule. 

(d) If /(a) = J[o,i](x), find the MRE rule. 

14. Let Q = (0, oo). Suppose that X\, . . . , X n are IID given 0 = 6 each with 
density 

/xiie(x|») = 5/(|), 

for some density function /. Let G be Group 2 and N = [0, oo). Let L(0, a) = 
(A;log(0) - rlog(a)) 2 , for some k,r > 0. 

(a) Find (5 so that this problem is invariant. 

(b) Characterize all equivariant rules. 

(c) Write a formula for the MRE rule. 

15. Let Xi,...,X n be IID U{O,0) random variables conditional on 0 = 6, 
and let the action space be N = [0,oo). Let the loss function be L(0,a) = 
(l-a/0) 2 . 

(a) Show that this problem is invariant under the one-dimensional scale 
group, Group 2. 

(b) Find the MRE decision rule. 

16. Prove Corollary 6.52 on page 366. 

17. Prove Proposition 6.53 on page 367. 

18. Suppose that Xi, . . . , X n are IID U(0 U 02 + *i) given 0 = 0 2 ), where 
Q = JR x IR+. Let N = Q and 

Show that this problem is invariant under Group 3, and find the MRE 
decision rule. 



6.4. Problems 



391 



19. Let / : 1R — > [0, oo) be a function such that f \x\f(x)dx < oo. Suppose that 
Xij . . . , X n are conditionally IID given 0 = 0 each with density f(x — 0). 
Let the prior density of 0 be proportional to /(c — 0). Suppose that the 
loss function is L(0, a) = p(0 — a) for some function p. If the formal Bayes 
rule exists, show that it is the same as the MRE decision based on a sample 
containing one extra observation X n+ i = c. 

20. Let X u . . . ,X n (n > 2) be IID with Exp(l/0) distribution given 0 = 0. 
Use the one-dimensional scale group, Group 2. Let the action space be the 
same as the parameter space, and let the loss be L(0,a) = (0 2 + a 2 )/(a0). 

(a) Find groups to act on the parameter and action spaces so that the 
decision problem is invariant. 

(b) Find the best equi variant rule. 

21. Suppose that Xi,...,X n are conditionally IID given 0 = 0 each with 
conditional density 



where a is known and the parameter space is Q = (0, oo). Let Group 2 (the 
one-dimensional scale group) act on the data. Let the action space be the 
same as the parameter space. 

(a) Find groups acting on the parameter and action spaces so that the 
decision problem with loss L(0,a) = (0 — a) 2 /0 2 is invariant. 

(b) Find the MRE decision rule. 
Section 6.3.1: 

22. Show that Theorem 6.74 applies, and state the conclusions of the theorem 
in the situation described in Problem 31 on page 289. 

Section 6.3.2: 

23. *Each part of this question assumes the hypotheses of the preceding parts. 

(a) Let P and Q be probability measures on (fft, #), where B is the Borel 
a-field. Suppose that X = (Xi,...,X n ) is an IID sample from a 
distribution with probability measure P. Let Y be another real- valued 
random variable independent of X with distribution Q. Let C = 
C(X) be a measurable subset of JR. Define the content of C by Q(C). 
Prove that the expected value of the content equals the probability 
that C(X) contains Y. You may assume all necessary measur ability 
conditions. 

(b) Let Q be a parameter space, and suppose now that P is only known 
to be an element of the parametric family {Pe\0 € fl} and that Q is 
only known to be an element of the parametric family {Qe\0 € £1} 
(same parameter space). Let Ee represent expectation with respect to 
the conditional distribution of X given 0 = 0. Suppose that we wish 



fx x \e{x\0) = 




x { 



392 Chapter 6. Equi variance 



to choose C in order to maximize Ee[Qe(C)\ uniformly in 9 subject 
to Ee[Pe(C)] < 0 for all 9. Prove that this is equivalent to finding 
a uniformly most powerful size 0 critical region for the hypothesis- 
testing problem : 

H : Xi , . . . , X ni Y are an IID sample from Pg for some 9 e 0, 
A: Xi , . . . , X n are an IID sample from Pq independent of Y 
which has distribution Qe for some 9 € ft. 

(c) Suppose that 9 = (/x,<x) € 1R x IR+, P 0 is the normal N(^a 2 ) 
distribution, and Q e is the N(^aa 2 ) distribution for some known 
a € (0, 1). Show that the hypothesis-testing problem from (b) is in- 
variant under the location-scale group. 

(d) Let S 2 = "^) 2 > and show tha * (y,X,S 2 ) is a sufficient 
statistic for this problem. Also, find a maximal invariant in the suffi- 
cient statistic space under the action of the location-scale group. 

(e) Among all sets C as described in part (a) which are also equivariant 
under the action of the location-scale group on X, find the one that 
uniformly maximizes E$[Q$(C)] subject to E e [Pe(C)} < 0 for all 
9. (Hint: You may wish to use the form of the t density given on 
page 672.) 

24. In Example 6.80 on page 380, prove that po(x) equals the posterior prob- 
ability that 0 is not in the interval B XiPg ( x ). 

25. Prove that Theorem 6.78 applies to the situation in Example 5.57 on 
page 319. For the case a = 0.05 and n = 10, find the conditional con- 
fidence coefficients for the two intervals (— oo,T*] and [T*,oo) given the 
ancillary if the sufficient statistic is (T\,T2) = (1, 1.3). 

Section 6.3.3: 

26. *Return to Problem 56 on page 293. Find a group of rotations and a loss 

function for estimating 62 that make the decision problem invariant. Show 
that the hypothesis and alternative H : 0i = 0 and A : 0i > 0 are 
invariant, and find the form of the UMPUAI level a test as closely as you 
can. (I do not think you can find the cutoffs in closed form.) 

27. *Suppose that X is distributed like iV fc (/x, E) given 0 = (/x, E). Let the group 

be Group 4 on page 354. Only one vector observation will be available. 

(a) Show that the family of distributions is invariant and show how a 
group element acts on the parameter space. 

(b) Suppose that we wish to test the hypothesis H : M = 0 versus A : 
M^O. Show that the hypothesis-testing problem is invariant, and 
find the maximal invariant in the data space. Why are invariant tests 
useless in this case? 

(c) Suppose that we wish to estimate M. Our action space is N = H fc , 
and our loss function is L(0, a) = (/i - a) T E _1 (M - a). Find a group 
G operating on N so that the loss is invariant. 



6.4. Problems 393 



(d) For the estimation problem, show that all equi variant rules are of 
the form 6(x) = cx for some scalar c. (Hint: First, prove that for 
i = l,...,fc, ifx has 0 in coordinate i, then 6(x) has zero in coordinate 
i also. Finally, write S(x) — a(x)x+/3(x)y(x), where y(x) is orthogonal 
to x for all x and the representation is unique unless f3(x) = 0. Then 
let A be an orthogonal matrix with first row proportional to x T and 
second row proportional to y(x) T .) 

28. Suppose that X\, . . . , X n are conditionally IID with iV(/x, a 2 ) distribution 
given G = (/z, a). Let G be the one-dimensional location group g c x = x-fcl. 

(a) Show that Assumption 6.82 holds. 

(b) For what kinds of hypotheses can we find UMPUAI tests? 

(c) Will these tests be UMPU? 

29. Suppose that Yi,i, . . . , Yi tUi are conditionally distributed as N p (fii,a) given 
Mi = fa and E = a for i = 1, . . . , k and all Yij are conditionally inde- 
pendent. (Here £ is a p x p positive definite matrix.) Suppose that the 
hypothesis to test is H : MAC = O, where M is the p x k matrix whose 
ith column is Mi, A is a k x k diagonal matrix with y/nl in the zth diagonal 
element, C is a k x r matrix that equals the first r columns of an orthogo- 
nal matrix, and O is a p x r matrix of all zeros. (Compare to the one-way 
analysis of variance on page 384.) Transform the data in order to put this 
problem into the form of the multivariate analysis of variance, and find the 
matrices W and B in the discussion that begins on page 386. 



Chapter 7 

Large Sample Theory 



7.1 Convergence Concepts 

In calculus courses, the concept of convergence of sequences is introduced. 
In this section, we will generalize that concept to include different types of 
stochastic convergence. 

7.1.1 Deterministic Convergence 

We begin by defining types of deterministic convergence. 

Definition 7.1. Let {x n }£Li be a sequence in a normed linear space, 1 and 
let {rvJ^Lx be a sequence of real numbers. We say that x n is small order 
of r n (as n — ► oo), denoted x n = o(r n ), if for each c > 0 there exists N 
such that \\x n \\ < c\r n \ for each n > TV. We say that x n is large order of 
r n (as n -* oo), denoted x n = 0(r n ), if there exists c > 0 and N such 
that ||x n || < c\r n \ for each n > N. If {y n }n=i is a sequence of vectors and 
x n -yn = o(r n ) (or C>(r n )), then we write x n = y n + o{r n ) (or y n + 0(r n ).) 

What large order and small order allow us to do is to discuss limits of 
ratios without being explicit about the ratios as long as they stay bounded 
or go to zero. Large order means that the ratio of the quantities remains 
bounded. Small order means that the ratio goes to 0. 

Example 7.2. Since lim n -oo log (n)/n = 0, we have log(n) = o(n). Also, n T = 
o(n p ) if p > r. It is easy to prove that (£) = G(n k ) for fixed k. 



2 The norm of x is denoted by Note that a normed linear space is a metric 
-space with metric y) - \\x - y\\. 



7.1. Convergence Concepts 395 



Here are some simple consequences of the definitions: 

• If x n = o(r n ), then x n = 0(r n ). 

• If c is real and nonzero, then x n - 0(r n ) if and only if x n = 0(cr n ). 
Similarly, x n = o(r n ) if and only if x = o(cr n ). 

• Suppose that y n = o(r n ) and x n = 0(y n ). Then x n — o(r n ). If 
x n = o(y n ), then x n = o(r n ). 

• If x n = o(r n ) and y n = o(s n ), then x n 4 y n = o(|r n | + |s n |). Similarly, 
if x n = 0(r n ) and y n = 0(s n ), then x n 4- y n = 0(|r n | 4- \s n \). 

• If x n = o(r n ) and y n = G(s n ), then x n + y n = 0(|r n | 4- |s n |). 

• If x n = o(r n ) and y n = o(s n ), then x n y n = o(r n s n ). Similarly, if 
x n = 0(r n ) and y n = 0(5 n ), then x n y n = 0(r n s n ). 

• If x n = o(r n ) and y n = 0(s n ), then x n y n = o(r n s n ). 

There will be several situations in which we need to use the concepts of 
small order and large order. Let {rvJSJLi be a sequence of real numbers. 

1. If limsup ||x n /r n || < oo, then x n = 0(r n ). 

2. If limsup ||x n /r n || = 0, then x n = o(r n ). 

3. x n = o(l) if and only if lim n — oo x n = 0. 

4. If r n = o(l) and m is fixed, then (1 4- r n ) m = 1 4- o(l). 

5. If x n ,fc = o(r n ) as n — ► oo for each k = 1, . . . , m, then X^fcLi = 
o(r n ) if m is fixed. 

This last example requires that m be fixed as n — » oo. To see that it is false 
otherwise, consider x n ,fc = 2 k /n = o(l) as n — ► oo. But, Ylk=i x n t k — ► oo 
as n oo. 

7.1.2 Stochastic Convergence 

Next, we define stochastic versions of small order and large order. The 
setup requires a sequence of probability spaces {(Xn,B ny P n )}%Li. Here, 
we assume that each space X n is a normed linear space with norm || • || n 
and that there are functions X n : S -► X n where (5, A, fi) is an underlying 
probability space. (As before, fi(A) for A € A will often be denoted Pr(A) 
and conditional probabilities derived from \x denoted Pr( | ).) In this case, 
P n is the probability induced on (X n , B n ) by X n from /i. A common example 
is the one in which S = 1R 00 , X n = lR n , and X n is the first n coordinates. 
All of the results in this section and Section 7.2 apply equally well to cases 
in which the probabilities P n are already conditional probabilities given 



396 Chapter 7. Large Sample Theory 

some parameter O. Of course, in such cases, P n would actually be Pe, n 
and Pr would be denoted P' e . Problems 5 and 6 (see page 468) show how 
to convert certain limit theorems that are conditional on 6 into marginal 
limit theorems. 

Definition 7.3. Let {Xn}™^ be a sequence of random quantities as above 
and let {rvJ^Lx be a sequence of numbers. We say that X n is stochastically 
small order of r n (as n — ► oo), denoted X n = op(r n ), if, for each c > 0 
and each e > 0, there exists N such that Pr(||X n || n < c\r n \) > 1 - c for 
all n > N. We say X n is stochastically large order of r n (as n — > oo), 
denoted X n = Op(r n ), if, for each e > 0, there exists c > 0 and N such 
that Pr(||X„|| n < c\r n \) > 1 - e for all n > N. If {F n }£° =1 is a sequence 
of random vectors and X n — Y n = op(r n ) (or 0p(r n )), then we write 
X n = Y n + o P (r n ) (or Y n + 0 P (r n ).) 

Proposition 7.4. X n = op(r n ) if and only if, for each c > 0, 
lim Pr(||X n || n <c|r n |) = l. 

n— +oo 

Note that in the definition of C? p , the c is allowed to vary with e, so there 
is no obvious analog to Proposition 7.4 for Op. We will usually leave the 
subscript n off of the norm || • || n , since there is seldom any chance of 
confusing one norm with another. 

Example 7.5. Let {Z n }^Li be IID random variables with mean /x and variance 
a 2 . Let X n - y/n(Z n - A*)/^- So, X n = IR for every n, and P n , which is the 
distribution of X n > is a probability measure on the Borel subsets of the real line. 
The central limit theorem B.97 (together with Problem 25 on page 664) says that 
lim n -ooPn((-oo,t]) = *(t) for ail t, where $ is the standard normal CDF. For 
each € > 0, there exists t such that $(t) - $(-*) > 1 - c/2. Choose iV such that 
for each n > iV, 

P n ((-oo, -t]) < *(-*) + |, P„((-oo,t]) > *(t) - \. 
It follows that Pr(|X„| < 0 equals 

P n ((-OC, t}) - Pn(("00, -t]) > $(t) - $(-t) - £ > 1 - C. 

Hence, X n = <9 P (1). 2 Also, Z n -ji = Op(l/y/n). If 0 < a < 1/2, then Z n - » = 
op(n" a ). In particular (a = 0), Z n - V = op(1). 

Stochastic convergence is closely related to the concept of convergence 
in probability. We restate Definition B.89 in the present context. 

Definition 7.6. If {X n }£° =1 and X are random quantities in a normed 
linear space, and if, for every e > 0, lim n _oo Pr(||X n - *|| > <0 = 0^ then 
we say that X n converges in probability to X, which is written X n -> X. 



2 This phenomenon is quite general. See Problem 3 on page 467. 



7.1. Convergence Concepts 397 



Proposition 7.7. Suppose that Y n — f n (X n ) for each n where f n : X n — > 
R, and Ris a normed linear space with Borel a -field. Assume that each f n is 
measurable. Let Y : S —» R be another random quantity. Then ||Y^ — V || = 
op(l), if and only ifY n converges in probability to Y. 

Example 7.8. Suppose that lim n ^oo E(Y n - c) 2 = 0. Then Tchebychev's in- 

p 

equality can be used to prove that Y n — > c. 

Definition 7.9. Let {Pq : 6 e ft} be a parametric family of distributions 
on a sequence space A* 00 , and let g : ft — > G be a measurable function to a 
metric space G with Borel a-field. Let X n = X n ', and let Y n : X n — ► G be 
measurable. We say that y n is consistent for g(0) if V n — ► g(0) conditional 
on 9 = 0 for all 0 e ft. 

Example 7.10. Let {X n }£Li be conditionally IID iV(/i,<7 2 ) given 0 = (cr,/z). 
Let Y n = Yl^i Xi/ n an d 0(0) = M- Then V n is consistent for #(0) according to 
the weak law of large numbers B.95. 

The following is a more general definition of "in probability." 

Definition 7.11. Suppose that {(Xn.Bn.Pn)}^ is a sequence of proba- 
bility spaces. Define y = YV£=i **n- Let T C y. We say that T occurs in 
probability, denoted V(T), if, for each e > 0, there exists T n (e) e B n for 
n = 1, 2, . . . such that P n (T n (c)) > 1 - e for each n and n^=i T n( e ) Q T - 

The following lemma essentially says that a sequence of random quanti- 
ties {^n}^ri is Op(r n ) or op(r n ) if and only if the set of possible values 
for (Yi, Y 2 , . . .) which are 0(r n ) or o(r n ) occurs in probability. 

Lemma 7.12. Use the notation from Definition 7.11. Let Y n = fn(X n ) 
and let 

T = {(x u x 2) . . .) e y : fn(Xn) = o(r n )}. 
TTien P(T) i/ and on/j/ z/F n = op(r n ). Similarly, if 

T = {(x u x 2 , . . .) € y : /„(i„) = 0(r„)}, 

tfien P(T) t/ and on/?/ z/y n = Op(r n ). 

PROOF. We will do only the op part since the Op part is similar. First, for 
the "if" part, assume Y n = op(r n ) and let e > 0. Let c\ > c 2 > • • • decrease 
to 0. For each i > 1, let iV(e, c;) > JV(c, Ci_i) be such that for n > N(e, c<), 
Pr(||r n || < Ci|r n |) > 1 - e. Define T n (c) = # n for n = 1, . . . , 7V(e, ci). For 
N(e,Ci-i) <n< iV(e,Ct), define 

T n (e) = {o: n :||/ n ( a : Tl )||<|r n |c i _ 1 }. 

By construction, we have P n (T n (e)) > 1 — e for every n. If (x\,x 2) . . .) € 
n^=i r n(c)» then limn-^oo ||/ n (x n )||/|r n | = 0 by construction. It follows 
that (#i, #2, • • •) £ T 1 and we have proven P(T). 



398 Chapter 7. Large Sample Theory 

For the "only if" part, assume that V{T) and let T n (e) be as in Defini- 
tion 7.11. Since f n (x n ) = o(r n ) for (xi,x 2 , . . .) E T, it follows that 

\\fn(Xn)\\ ^ 
Z n = SUP ■ : < OO, 

x n eT n (e) \rn\ 

for all but finitely many n. Hence, 

Pn ({*« : Ml^K <*„})> Pn(T n (e)) > 1 - 6. 

Now, choose x* G T n (e) such that 

^<J!^ + I^ 0 ,«b»-.oo i 
\r n \ n 

so lim n -oo 2 n = 0. For each e > 0 and c> 0, choose AT such that if n > AT, 
then 2 n < c. It follows that, if n > JV, then Pr(|| H/Jr^l < c) > 1 - e. 
Hence, F n = 0p(r n ). □ 

Example 7.13. Let f n : X n ->1R and Y n = f n (X n ). Define 
T = {(xi,x 2 ,...) : Hm / n (a;n) = 0}. 

n— *oo 

Then Y n = o P {l) if and only if P(T), according to Lemma 7.12. 

If countably many things occur in probability, then they simultaneously 
occur in probability. 

Proposition 7.14. 3 If V{Si) for i = 1,..., tten ^(H^i^)- C S, 
then V(T) implies V(S). 

We are now in position to prove a theorem that says (in a more precise 
manner) that if you can prove a result involving o and 0, then you can 
replace oby o P and 0 by Op and prove a corresponding result. 

Theorem 7.15. 4 Let y 0 , ^1,1^1,2, • • • , ^2,1, 3^,2, • • • be metric spaces. Let 
h n :X n -> y 0 , fi j) : * n - yij, J = 1,2..., and : X n -> y 2 ,k for 
k = 1,2,.... Suppose ttat /^ } (X„) = Op(r^ } ) and g { n k \x n ) = o P (sk } ) 
/or a// j and fc. A/so, suppose that it is known that 

( /^frn) = ^( r n } ) <™* ^frn) = o^) M a// j and fc) trnpfo'e* 

M*n) = 0(t„) (or (M*n) = o(t n )), 

tten h n (X n ) = 0 P (t n ) far M*n) = 



3 This proposition is used in the proof of Theorem 7.15. 
4 This theorem is used to help develop the delta method. 



7.1. Convergence Concepts 399 



Proof. We will only prove the Op part. The op part is virtually identical. 
Let S^- 1 ) = {x : $\x n ) = 0(r { n j) )} for all j S™ = : ^ fc) (x n ) = 
f° r au< fc- there are only finitely many ffi or g n k \ then just let 
— Il^Li after you run out of functions.) Define T = {x : h n (x n ) = 
0(t n )}. The stated conditions imply that flg^W C T. Also, we have 
assumed that P(S^) for all i, so 7>(T) by Proposition 7.14. □ 

Example 7.16. Suppose that u; : 1R -> JR has A; + 1 continuous derivatives at c. 
Define 

T k (x } c) = w(c) + (a: - c)w\c) + . . . + l( x - c) V fc) (c), 

where # (fe) denotes the Arth derivative of p. Taylor's theorem C.l says (among 
other things) that 

\im w{x) - T \^ c K 0 . 

x->c (X — C) k 

Suppose that x n - c = 0(r„), where r n = o(l). Then x n - c = o(l), and we 
conclude that w(x n ) -T k (x n ,c) = o((x n -c) fc ), hence tu(a; n ) = T fc (a: n ,c) + o(r*). 
Similarly, we can write w(x n ) = T h -i(x n ,c) + 0(r*). Now, suppose that X n - 
c = 0 P (r n ). In the notation of Theorem 7.15, let X n = IR for all n and let 
3>o = ^1,1 = JR. For each n, let /^(x) = x and fc n (.) = w(-) - T fc (.,c) or 
w(') - T k -i (•, c). Suppose that there are no # functions. Then Theorem 7.15 savs 
that J 

w(X n ) = T k (X n) c) + o P (r k n ) = T k ^{X n ,c)^0 P {r k n ). 

Furthermore, if w has k + 1 continuous derivatives everywhere, then if X n - X* = 
0p(r n ), then w(X n ) = T k (X n ,X*) + o P (r*) = T fc _!(X n ,^) + 0 P (r*). " 

Corollary 7.17. 5 Let y and Z be metric spaces. IfY n = f n (X n ) 6 y and 
Yn~* cey and g :y Z is continuous at c, then g(Y n ) 4 g(c). 

Another type of stochastic convergence is convergence in distribution 
We restate Definition B.80 here. 

Definition 7.18. Let {X n }%> =1 be a sequence of random quantities and let 
X be another random quantity, all taking values in the same topological 
space X. Suppose that 

n lim E(/(* n )) = E (/(*)) 

for every bounded continuous function / : X -> JR; then we say that 
X n converges in distribution to X, which is written X n £ X or C(X n ) -> 
C(X). If X n -»X , we call the distribution of X the asymptotic distribution 
of X n . If X n -> X, and if and # are the distributions of X n and X, 
respectively, then we say that R n converges weakly to R, denoted Rn^ R. 

5 This corollary is used to help prove that posterior distributions are asymp- 
totically normal. H 



400 Chapter 7. Large Sample Theory 



The portmanteau theorem B.83 gives several criteria that are equivalent 
to convergence in distribution. These can be used to derive a connection 
between convergence in distribution and op. 

Lemma 7.19. 6 Suppose that X is a metric space with metric d. IfX n ^X 
and d(X n ,Y n ) = op(l), then Y n -» X. 

Proof. Let R n be the distribution of Y n , and let P be the distribution of X. 
We must show that R n ^ P. (See Definition 7.18.) Let B be an arbitrary 
closed set. According to the portmanteau theorem B.83, it suffices to show 
that limsupH n ( J B) < P(B). Define, for C € B, 

d(x,C) = inf d(x,y). 
yec 

Then 

{Y n £B}C {d(X n , B) < e} U {d(X n , Y n ) > e}. 
Define C e — {x : d(x y B) < e}, which is a closed set. So, 

R n (B) = Pr(F n G B) 

< Pr(d(X n , B)<e) + Pr(d(X n , Y n ) > e) 
= P n (C € ) + Pr(d(X n ,Y n )>e). 

We have assumed that linin—oo Pr(d(X n ,Y n ) > e) = 0 and that X n — > X, 
so we conclude limsup^^ R n (B) < lim sup^^ P n (C c ) < P(C e ). Since 
B is closed, lim c _> 0 P(C e ) = P(B). It follows then that 

lim sup R n (B) < P(B), 

n— »oo 

hence 7 n £ X. D 
Lemma 7.19 says that if X n % X, then so too does anything close to X n , 
that is, anything that differs from X n by op(l). 

Theorem 7.20. // the a-field on X xy is the product a -field, if X n -> X 
and Y n % Y, and if X n is independent ofY n for all n, then (X„, Y n ) -+ 
(X,Y), where X and Y are independent. 

Proof. Since X n and Y n are independent for each n, their joint charac- 
teristic function is 

0(M) = Eexpjz( I y ( y n n ^|=Eexp{it T X n }Eexp{i 5 T F n }. 

6 This lemma is used in the proofs of Theorems 7.22, 7.25, 7.35, and 7.63 and 
to help develop the delta method. 



7.1. Convergence Concepts 401 

This product converges to Eexp {it T X} Eexp {is T Y}, which is the char- 
acteristic function of independent X and Y. Now apply Theorem B.93. 
□ 

Using the fact that a constant is independent of everything, we have the 
following simple corollary to Theorem 7.20. 

Corollary 7. 21. 7 Suppose that {X n }^ =1 take values in a metric space X. 
If X n X and b ey is a constant, then (X n , b) — ► (X, b). 

The conclusions of the following theorem are taken for granted in many 
calculations of asymptotic distributions. 

Theorem 7.22. 1. Suppose that {X n }™ =1 take values in a topological space 
X and that {Y n }^ =1 take values in a topological space y. If (X ni Y n ) — ► 
(X,Y), thenX n ^X. 

2. Suppose that {X n }J° =1 take values in a metric space X and that {Yn}™^ 

T> P 

take values in a metric space y. Let bey. If X n — > X and Y n — * b, then 
(X n ,Y n )°(X,b). 

Proof. For part 1, let g : X x y -* X be defined by g(x,y) = x. Then g 
is continuous and the continuous mapping theorem B.88 says that X n = 
g(X n ,Y n )°g(X,Y) = X. 

For part 2, let d\ be the metric in X and let d 2 be the metric in y. Then 

d({xi,yi) ) (x 2 ,y2)) = di{x 1 ,x 2 ) -f d 2 (2/1,2/2) 

is a metric in X x y and the product a-field is the Borel a-field. By 

Corollary 7.21, we have that (X n ,6) — > We have assumed that 

d((X n ,y n ),(X n ,6)) = d 2 (Y n ,b) = o P (l). So, by Lemma 7.19, (X n ,Y n ) ° 
(X,b). □ 

7.1.3 The Delta Method 

A method for finding the asymptotic distribution of a function of a random 
vector is based on Lemma 7.19 and is called the delta method. [See Rao 
(1973), Chapter 6.] As an example, let Y n be the average of n IID random 
variables with mean /x and variance a 2 . The central limit theorem B.97 says 
that y/n(Y n — /i) 7V(0,cr 2 ). 8 Now, let g be a function with continuous 
derivative. We can write 

9(t) = 9{») + (t - ii)g'{ii) + o{t - fi). 



7 This corollary is used in the proof of Theorems 7.22. 
8 It is common to call an estimator Z n , with the property that y/n(Z n 
converges in distribution to a nondegenerate distribution, y/n- consistent. 



-0) 



402 Chapter 7. Large Sample Theory 



If we are interested in g(Y n ), we can write 

g (Y n ) = 5 (/i) + (r n -/zyM + o(y n - M ), 

Since \/™(*n — aO converges in distribution, Y n - /i = Op(l/y/n). Hence 
x/no^n - /i) = op(l) by Theorem 7.15. So, 

Mg(Yn) - = v^(^n - Mb*) + 0 P (1). 
By Lemma 7.19, we get a useful result when g'(/j,) ^ 0: 

V^(g(Yn) - g(fi)) Z N(0,a 2 [g'(n)} 2 ). 

The result in the example above suggests a valuable use for the delta 
method. If the variance of the asymptotic distribution of y/n(Y n - /x) is 
an undesirable quantity in the application for which it is intended, then 
a transformation of Y n will have a different variance that may be more 
suitable. For example, suppose that nY n has Bin(n,p) distribution given 
P = p. The asymptotic distribution of y/n(Y n - p) is iV(0,p(l - p)) given 
P = p. For comparing several possible values of P, it might be nice if the 
only dependence of the random variable on P were through the mean. This 
can be arranged asymptotically by choosing a function g such that 

g'{p) = 1 



This is a simple differential equation to solve, and the solution is g(t) = 
2arcsin( v / t). The asymptotic distribution of 

yfn [2 arcsin - 2 arcsin (^/p) J 

is xV(0, 1) given P = p. This is a special case of what is called a vari- 
ance stabilizing transformation. The general method for constructing a vari- 
ance stabilizing transformation is as follows. Suppose that y/n(Y n - /x) has 
asymptotic distrib ution 7V(0, /i(/x)). Then, choose a function g(t) such that 
g'(li) = VV^Im)- That is, 



t l 1 



zdfx, 



where c is any constant such that the integral exists. The asymptotic 
distribution of V*(g(Y n ) - g(n)) will be N(0, 1). It is common, when 
v /^(y n - n) Z N{0,a 2 ), to say that the asymptotic distribution of Y n 
is N(n,o 2 /n). In symbols, we may write Y n ~ AN(n,cr 2 /n). In such cases 
we will call cr 2 /n the asymptotic variance of Y n . 



7.1. Convergence Concepts 403 



There is also a multivariate delta method. If g : JR fc — ► IR has continuous 
first partial derivatives, let V#(/i) be the gradient (vector of first partial 
derivatives) at /i. Then g(t) = g{fi) + (t - /i) T Vg(fi) + o(t - /x). If y/n{Y n - 

Vnb(Kn) - 9(1*)] % N(0, Vg{ii) T oVg{ii)). 
Here are some multivariate applications of the delta method. 

Example 7.23. Importance sampling (see Section B.7) is a means of approxi- 
mating the ratio of integrals of the form j v(0)h(0)d0/ J h{0)dO. Let {X n }Z=i 
be an IID sequence of pseudorandom numbers with density /, and let Wi = 
f(Xi) and Zi = v(Xi)Wi. If these have finite variance, then the sample 
averages (W n ,Z n ) will, by the multivariate central limit theorem B.99, be ap- 
proximately bivariate normal with mean = (f h(0)d9,f v{6)h{0)d0) and 
covariance matrix equal to 1/n times the covariance matrix cr = {(cnj)) of the 
(Wi,Zi) pairs. Now, apply the delta method to find the asymptotic distribution 
of the ratio of the sample averages. The asymptotic mean is the ratio we 
want to approximate, and the asymptotic variance is 

£ 2 1 £ 

CTl,l— + <r 22 _ -2(71,2-—, 

u> z a> 3 

In practice, it is common to approximate a by the sample covariance matrix of 
the (W i9 Zi) pairs. 

The following example uses the reasoning behind the delta method with- 
out using the delta method itself. 

Example 7.24. Suppose that we wish to find the asymptotic distribution of the 
roots of polynomials with random coefficients. Let Y n ~ ANk+i(n, £/n), where 
Y n — (K n o, . . . , Y n k) . Define the polynomial 

k 

Let U* be the smallest root of p n {u). Define p{u) = 0 /x,V, and suppose 
that its smallest root is u 0 and this root has multiplicity one. That is, p(u 0 ) = 
0 but p (no) * 0 It is not difficult to show that the smallest root with odd 
multiplicity of a polynomial is a continuous function of the coefficients. 9 It follows 

« rll h % m l n Te T n v S th l a P ol y nomiaI changes sign as the variable passes 
a root of odd multiplicity. There will be points arbitrarily close to the root at 
which the polynomial has opposite signs. If the coefficients don't change much, 
the signs will remain the same at these points, hence a root will be between them! 

il r °u j 6Ven multi P licitv > th e polynomial would have constant sign in a 
neighborhood of the root, and small changes in the coefficients could remove all 
roots from the neighborhood. 



404 Chapter 7. Large Sample Theory 

from Theorem 7.15 that U* uq. To find the asymptotic distribution of Un, 
write p n (Un) as 



p 

where V* is between uo and £/*. So, 5 C {V^ — > uq}, and — ► Uq also. 



0 = Pn(U*) = PnK) + {Un - tlo)j>(,(V„*). 

betw 

Furthermore, 

k k 

pUv:) = j^y-^in - £>r = P ; (t*o) * o. 

3=1 3=1 

So, - uo = bn(^) - Pn(uo))/W*, where W* = j/ n (V n *) £ p'(tio). Now, let 
it T = (1, tio, • • • , Uo) an( l write 

r-fTT* \ r-Pn(uo) r-Ylj=0 Yn j U 0 

Vn{U n -u 0 ) = -0* = -0* ^ 

= w* = -w^ u{Yn ~^ 

which converges in distribution to N(0, u T Hu/\p'(uo)] 2 ) by Theorem 7.22. 



7.2 Sample Quantiles 

The reader interested in a thorough treatment of sample quantiles should 
read the book by David (1970). In this section, we present some of the more 
commonly used asymptotic results on the distribution of sample quantiles. 

7.2.1 A Single Quantile 

Suppose that {X n }^Li are conditionally IID random variables with distri- 
bution P given P = P and suppose that P has a CDF F with derivative 
/ (at least in a neighborhood of x p where F(x p ) = p) and oo > f(x p ) > 0. 
If the observed values of the first n X h when ordered from smallest to 
largest, are x (1) , . . . ,x (n) , define the empirical CDFby F n (x {i) ) = i/n and 
interpolate linearly in between. (Do something arbitrary, but continuous 
and strictly increasing below x w .) Now, F n is continuous and strictly in- 
creasing on (-oo,x (n )]. 10 Define the sample p quantile by F p (n) = F~ l (p) t 
for0<p<l. 

The goal of this section is to prove a theorem specifying the asymptotic 
distribution of a sample quantile. 



10 If F(c) = 0 and F(x) > 0 for x > c, then we only need F n to be strictly 
increasing on [c, X( n )]. 



7.2. Sample Quant iles 405 

Theorem 7.25. Suppose that {XrJJJLj are conditionally IID with distri- 
bution P given P = P and suppose that P has CDF F with derivative f 
in a neighborhood of x pj where F(x p ) = p, 0 < f(x p ) < oc, and 0 < p < 1. 
Define Y^ n) = F~ x {p), where F n is the empirical CDF of (Xi, . . . ,X n ). 
Then 

The proof relies heavily on the following lemma. 

Lemma 7.26. For each z € IR there exists a sequence of random variables 
{An(z)}%Li su ch that A n (z) = o P (l/ v / n) and 

\fa(Y p {n) - x p ) < z, if and only if -^(p - F n (x p )) < z + V^^T^' 

(7.27) 

Proof. Define 

A. W = A (x, + -i.) - ft W - , . 3. . ^ w + Unt 

(7.28) 

where B„ is the number of observations in the interval (x p ,x p + 

and tf„ satisfies Pr(|C/ n | < 2/n) = 1. In particular, U n = Op(l/n). The 

conditional distribution of £ n given P = P is Bin{n,6 n ), where 

The characteristic function of v/n(Ai(2) - i/„) is 
Eexp{i^t(A n {z)-U n )} 
= exp{~itzf(x p )}Eexp(it^ 

= (l-0„ + 0 n exp{-j|j) exp{-i I2 /(x p )}. 



We can write 
It follows that 



= (> + '|fl^ + o(i)Y 



exp{2**/(z p )}, 



406 Chapter 7. Large Sample Theory 
as n — > oo. So 

lim Eexp {iy/nt (A n (z) - U n )\ = 1, 

for all t. So, v^n(z) - U n ] % 0 by the continuity theorem B.93, and 
y/^[A n (z) -[/ n |io by Theorem B.90. A n (z) = [/ n + o P (l/y/n). Since 
t/n = 0 P (l/n) = o P (l/ v / n), it follows that ;4 n (*) = o P (l/y/n). 
Finally, we prove (7.27). The following inequalities are all equivalent: 



MY<; n) -x p ) < z, 

< Xp + _L 
y s/n 

Fn(YW) < F n (x p +-j=y 

P * F "( Xp+ ^)' 

P < A n (z) + F n (x p ) + -j=f(x p ), 

V™ t i-i / m /—A n (z) 

v — : \p-F n (x p )) < z + vn- nw 



/(*,)"- ^~ p " ~ v f(x p )- 

The equivalence of the first and last of these is (7.27). □ 

Now, we are ready to prove Theorem 7.25. 
Proof of Theorem 7.25. From Lemma 7.26, we know that 

Pr(Vn(rW - x p ) <z) = Pv (j^(P - Fn(x P )) - V^j^ < *) . 

We will prove that the right-hand side of this equation converges to the 
necessary normal probability. We have that F n (x p ) = C n /n+D ni where C n 
is the number of observations less than or equal to x p and D n = Op(l/n). 
Also, A n (z) = op(l/v/n), so 

The central limit theorem B.97 tells us that y/n(C n /n-p) -> N(0,p(l-p)). 
This, together with Lemma 7.19 applied to (7.29), completes the proof. □ 

Example 7.30. Suppose that F has derivative 



7.2. Sample Quant iles 



407 



where a > 0 and \i are some numbers. If p = 1/2, x p = p, and /(x p ) = (<77r) *. It 
follows that the sample median nas asymptotic distribution (given P = P) 



where a > 0 and p, are some numbers. If p = 1/2, x p = p and /(x p ) = (av^7r) 
It follows that the sample median has asymptotic distribution (given P = P) 



The asymptotic variance of Y^ is 1.571<7 2 /ra. 

For distributions that are bounded above or below, a different sort of 
result holds for the p = 1 or p = 0 quantile. The following theorems are 
examples. 

Theorem 7.32. Suppose that t G M> a > 0, and 



Let {X n }5JLi be IID with CDF F and let X {n) = max{Xi, . . . , X n }. Then 
n 1 /"^ — X( n )) converges in distribution to a distribution with CDF G(x) = 
1 — exp(— cx a ), for x > 0. 

Proof. Write 




The asymptotic variance of Yjy^ is 2.467cr 2 /n. 
Example 7.31. Suppose that F has derivative 





lim(t - x)~ a [l - F(x)] = c> 0. 




Since 




it follows that 



n 



lim Pr(n* [t — X( n )] > x) = exp(— cx Q ). 



□ 



408 Chapter 7. Large Sample Theory 

Example 7.33. Suppose that {X n }^° =1 are conditionally IID 11(0,0) given 9 = 
0. The CDF of X» (given 9 = 0) is x/0 for 0 < x < 0 and 1 for x > 0. 
With t = 0 we get lim lTt (t - x) _1 [l - F(x)] = 1/0. So Theorem 7.32 says that 

n(B - X {n) ) Z Exp(l/0). 

A similar theorem can be proven for distributions bounded below. 

Proposition 7.34. Suppose thatt G 2R, a > 0, and lim x |t(x-t)" a F(x) = 
c> 0. Let {X n }£° =1 6e JJD wtifc CDF F and let = min{Xi, . . . ,X n }. 

converges in distribution to a distribution with CDF 
G(x) = 1 — exp(— cx a ), /or x > 0. 

Krem (1963) proves that extreme order statistics (like the min and max) 
are asymptotically independent of the central order statistics (like the quan- 
tiles). 

7.2.2 Several Quantiles 

We can prove a theorem similar to Theorem 7.25 for several sample quan- 
tiles simultaneously. 

Theorem 7.35. Let 0 < pi < • • • < p k < 1. Suppose that {Xn}^ are 
conditionally IID with distribution P given P = P and suppose that P has 
CDF F with derivative f in a neighborhood of each x Pi (i = l,...,k), where 
F(x Pi ) = pi, 0 < f(x Pi ) < oo, and 0 < p < 1. Define Y p { ? = F~ l {pi), 
where F n is the empirical CDF of (X\,. . . , X n ). Then 



Let z u • • • , Zk be real numbers and let A^ n (zi) equal (7.28) with p = pi for 
i = l,...,fc. Then, 



Since (i4i |n (*i),-..,^M(**)) = MVv^)> it follows from Lemmas 7.19 
and 7.26 that the two vectors Z n and W n converge in distribution to the 
same thing if either one of them converges. It is easier to find what W n 
converges to, so that is what we will do. 
We can write 



where V = ((^)) and ^ = p mm {ij} - PiPj/[f( x Pi)f( x Pj)}' 
Proof. Define 




^»,n < *t if and only if W^ n - y/n 



/(**) 




7.2. Sample Quantiles 409 



where Mj is the number of observed values in the interval {x Pj _ 1 ,x Pj ]. 
For convenience, set po = 0, pk+\ = 1, x Po = an d ^p fc+1 = oo. It is 
clear that the conditional distribution of (Mi, . . . , M^+i) given P = P is 
multinomial, Mult(n\ q\, . . . , qft+i), where qi—Pi — Pi-\ for £ = 1, . . . , fe + 
1. Set G n — (Mi, . . . , Mfc+i) T and g = (<7i, . . . ,<7fc+i). The multivariate 
central limit theorem B.99 implies that, conditional on P = P, 



v^(iG n -(7)SjV fc (0,E) 
n 



(a multivariate normal distribution), where 



/ 9i 0 0 \ / qi 

0 '• 0 - 1 
V 0 0 q k+ i j \ q k +i ) 



(<?! 



><7k+i) • 



Next, note that 



/ F n (x Pi ) \ 



= -,4G n + o P 
n 



(Vn) 



(7.36) 



where .4 is the A; x {k + 1) matrix with 1 on and below the diagonal and 0 
above the diagonal: 



/ 1 0 
1 1 



0 
0 



V i i i 



o \ 
0 

o / 



Call the vector on the left of (7.36) R. Then the conditional mean of R 

given P = P is p = Aq = (p u . . . ,p fc ) T , and y/n{p - R) " N k (0,AXA T ). 
All that remains is to compute AEA T . This can be seen to equal 





/ Pi 


Pi ' 


Pi \ 




( pi \ 




Pi 


P2 ' 


" P2 




Pi 


AEA T = 






\ pi 


P2 ' 


" Pk ) 




\ Pk J 



(PW'iPk)- 



□ 



Since the definition of F n is arbitrary between observed values, the 
asymptotic distribution in Theorem 7.35 applies to every vector of ran- 
dom variables whose iih coordinate is between ^(j_i) and when 
{j - l)/n < Pi < j /n. 

An analogue to Theorems 7.32 and 7.34 can be proven for the joint 
distribution of the smallest and largest order statistics. 



410 Chapter 7. Large Sample Theory 

Proposition 7.37. Suppose thatti,t2 E 1R, ai,a2 > 0, and 
\im(x-ti)- ai F{x) = cx>0, 

xlti 

lim (t 2 - x) " Q2 [1 - F{x)} = c 2 > 0. 

Lei {Xn}^! be IID with CDF F, and let X {1) = mm{X u . . . ,X n ) and 
X( n ) = max{Xi , . . . , X n }. Then the asymptotic joint CDF of n 1 /** 1 (X^ — 
h) and n l / a *(t 2 - X (n) ) is (1 - exp{-C!x" l ))(l - exp(-c 2 a£ 2 )). 



7.2.3 Linear Combinations of Quantiles* 

A linear combination of sample quantiles is called an L-estimator. Suppose 
that / is the derivative of the conditional CDF of the Xi given P = P and 
that / is symmetric about g(6). Here, we suppose that F = Q~ x (0) and 
f(x) = h(x — g(0)). Let Zi = Xi- g(0). Then the conditional density of the 
Zi is h(-). If we choose to sample quantiles symmetric about the median, say 
p and 1 -p, then x p = 2g(0) — x\- p by symmetry. Let z p = x p —g(0) for all p 
so that z p = -zi- p . Let W^ n) = Y p (n) -g(0), so that wj, n) -z p = Y p (n) ~x p . 



y/n 



( W^-z p \ 



N 3 



0, 



2h(z p )h(0) 4/i2(0) 
\ MJ7> 2h(z P V(0) 



V 

2/i(z p )h(0) 
(l-p)p 



If the goal is to estimate #(9), it might be good if the asymptotic mean were 
g(0) given 6 = 0. The asymptotic conditional mean of a x Y p (n) + ^Y^l + 

as^i-p is (fli -f a 2 + a 3 )0(#) + (a 3 - a^ElY^, since h is symmetric around 
0. For p < 1/2, this will equal g(0) for all 0 if and only if a 3 = ai and 
ai + fl2 + a 3 = 1. Hence our estimator must be 



= ayW + (1 - 2a)F, w + aY^, 



(7.38) 



for some a. 

Example 7.39 (Continuation of Example 7.30; see page 406). As an example, 
consider the case of Cauchy distributions with a location parameter 6. Then 
Zi=Xi- 0, h(x) = (tt[1 + X 2 ])" 1 , and zi- p = tan[7r(l/2 - p)]. So, for example, 
if p = 1/3, then 2 P = -l/>/3, *i_ P = 1/^3, and /i(*p) = 3/(4tt) = h(zi- P ). The 
asymptotic covariance matrix of the three sample quantiles is then 




This section may be skipped without interrupting the flow of ideas. 



7.2. Sample Quantiles 411 



The asymptotic variance of the estimator in (7.38) is 

Cl-2.,a)E(l-"a.)-^(i-| + a"ll). 

This variance is minimized at a — 3/22. The minimum asymptotic variance is 
87r 2 / (33n) = 2.39/n, which is not much better than we got with the median alone 
(2.467/n). 

Perhaps improvement can be made in Example 7.30 by altering p. The 
general method for doing this is illustrated by continuing the example. 

Example 7.40 (Continuation of Example 7.39; see page 410). For general p < 
1/2, 

h(z p ) = - cos 2 7T (p - = -c(p), 
say, where c(0) = 1. The asymptotic covariance matrix of the three quantiles is 



/ P(l-P) _J^L 
c(p) a 2c(p) 



2c(p) 



2c(p) 



c(p) 2 2c(p) 

The asymptotic variance of the estimator in (7.38) is 



(a, 1 - 2a, a)£ I 1 - 2a 
V a 



7T 

n 



i- 2a G~i)) +a2 ( 1 + ^-^)). 



The variance is minimized at 

a = a* (p) = 
and the minimum variance is 

2 



7T 

4n 



1 - 



c(p) 2 - 2pc(p) 
2[c(p) 2 - 4pc(p) -f 2p] ' 

(c(p)-2p) 2 



c(p) 2 - 4pc(p) -h 2p 



s(p). 



We can numerically minimize s(p) and find the minima occur at p = 0.42085 and 
at p = 0.07915. The minimum s(p) is 2.302, which is only slightly better than 
using p = 1/3. 

Example 7.41. Suppose that the distributions are double exponential (also 
known as Laplace distribution). That is, h(x) = exp(— |x|)/2. Then 



■{ 



log2p 



ifp<i h(z)= ( p if P^i 
log2(l-p) ifp>i, (Zp) \ 1-p ifp>I. 



412 Chapter 7. Large Sample Theory 



The asymptotic covariance matrix of three symmetric sample quantiles is 




1 

The asymptotic variance of the estimator in (7.38) is 

(o,l-2a,a)E ^ 1 - 2a ^ = 0 - 4^ + 1, 

which has a minimum at a = 0 and the minimum value is 1. This means that, no 
matter what p is, it is better to use just the median. 



7.3 Large Sample Estimation 

7.3.1 Some Principles of Large Sample Estimation 

One would hope that if a large sample were available, then better knowledge 
of P would be available, and we would be close to the situation of having 
independent observations. Since predictive inference in usually not the goal 
for classical statistics, the issue becomes how well we have estimated ©. 
There is the belief that an estimator ought to get 9 correct eventually. 
That is, the estimator should be consistent (see Definition 7.9). If more 
than one estimator is consistent, then one might ask, "Which is better?" 
Without a loss function or some indication of how we plan to use the 
estimator, this question is not interesting. There are, nonetheless, answers 
to the question. 

Let © be fc-dimensional, and suppose that the FI regularity conditions 
(see Definition 2.78) hold. Then the Fisher information matrix (based on 
a single observation) I Xl (0) can be calculated. Suppose also that an es- 
timator © n of 0 converges in distribution, say y/n(® n - 0) -* Nk(0,Ve) 
given 0 = 0. If we wish to estimate g(G), with g continuous, then the delta 

method tells us that v^[ff(©n) - ff(©)] ^ N(0,c$V e ce), where 




(7.42) 



Corollary 5.23 says that the smallest possible variance for an unbiased 
estimator of g(S) is cJl Xl (0)~V Since g(e n ) is asymptotically unbiased, 
the ratio of these two variances might be used as a measure of how good a 
consistent estimator is. 

Definition 7.43. If G n is an estimator of g(Q) for each n and 
\fa{G n - 9(0)) £ N(0,v 9 ), for all 6, 



7.3. Large Sample Estimation 413 



then the ratio CqTx x (0)~ 1 cqIvq is called the asymptotic efficiency of G n at 
0. If the ratio is 1, the sequence {Gn}^ is called asymptotically efficient. 

Suppose that {Gn}™^ and {G^}™^ are sequences of estimators of g(@), 
and we have a specific criterion that we require of our estimator, such as 
variance equal to e. Suppose that G no and G' n , satisfy this criterion. Then 
the relative efficiency of {G n }£Li to {Gn}£Li f° r the specific criterion is 
n' 0 /no. Suppose that the criterion is allowed to change in such a way that 
the sample sizes required to satisfy it go to oo, for example, variance equal 
to e with e going to 0. If the ratio n f 0 /no converges to a value r, then r is 
called the asymptotic relative efficiency (ARE) 11 of {G n }%Li to {G^Jn^i- 

Example 7.44. Let {X n }^Li be conditionally IID with AT(/z,cr 2 ) distribution 
given 0 = (p<iCr). Let g{9) = /i. Let G n — X n , the sample average, and let G' n 
be the sample median. Let our specific criterion be that the asymptotic variance 
of the estimator must equal e. Since the central limit theorem B.97 says that 

y/n(G n —fi) N(0, a 2 ), and Example 7.31 on page 407 shows that y/n(G' n —ii) -2- 
AT(0, cr 2 7r/2), we have the relative efficiency equal to yj2/-K = 0.798 for all c. If 
we let e — ► 0, the ARE of the sample median to the sample mean is 0.798 as well. 

The idea of ARE is to compare the sizes of samples needed to make 
comparable inferences from the two sequences. 

Example 7.45. Suppose that {X n }%Li are conditionally IID [7(0, 0) given B = 
^ : _The MLE is 0 n = maxXi. Another estimator is twice the sample average, 
2X n . Suppose that our criterion is that the actual variance of the estimator 
must equal 6 2 e. Since O n /0 has Beta(n, 1) distribution, the variance of 0 n is 
6 2 n/[(n + l) 2 (n + 2)]. The variance of 2X n is 0 2 /(3n). Let no be the sample size 
at which @ n has variance 0 2 e, and let n 0 be the sample size at which 2X n has 
variance 0 2 e. It is easy to see that we must have n 0 = (no + l) 2 (no 4- 2)/(3no). 
So, n f 0 /no = (n + l) 2 (n + 2)/(3n 2 ) for all e. As e — ► 0, n — ► oo and the ratio n 0 /no 
goes to oo. That is, the ARE of B n to 2X n is oo. 

Example 7.46. Let H and H' be nondegenerate distributions that have some 
common scale feature (like the same finite standard deviation or the same in- 
terquartile range). Suppose that a n (G n - g{0)) ^ H and b n {G' n - g(0)) ^ H'. 
Suppose also that lim n _oo a n /b n = r. Then r is the relative rate of convergence 

of {G n }n=i to {GnK£=i. 12 Note that when H and H ' are both norm al and a n 
and b n are both O(yfn), the relative rate of convergence is the square root of the 
ARE for asymptotic variance. 

In Section 7.3.2, we show that the class of maximum likelihood estimators 
(see Section 5.1.3) are efficient under quite general conditions. At first, it 



11 This definition of ARE is taken from Serfling (1980, pp. 50-52). Serfling's 
definition actually applies to more types of inference than estimators, but we will 
not pursue that generality here. 

12 Solve Problem 22 on page 470 to show that the relative rate of convergence 
is uniquely defined. Relative rate of convergence is not an example of a criterion 
for ARE, but it has a similar nature. 



414 Chapter 7. Large Sample Theory 



might seem that achieving asymptotic efficiency of 1 would be the best 
possible, but sometimes efficiency greater than 1 is possible. 13 

Example 7.47. 14 Suppose that {X n }™ =l are conditionally IID JV(0, 1) given 
0 = 0. We already know that 2x x (0) = 1 and y/n(X n - 0) ~ AT(0, 1), so X n 
is asymptotically efficient. (Actually, it is efficient in finite samples.) Let 0o be 
arbitrary, and define a new estimator of 6: 

^ f X n _ if \Xn-0o\>n-$, 

n \ So + a{X n - 0 O ) if |X n - 0o| < n"i , 

where 0 < a < 1. This is like using X n when X n is not_dose to 0o, but using the 
posterior mean of B from a prior centered at 0o when X is close to Oq. 
We will now calculate the efficiency of 6 n . Suppose that 0 ^ Oq. Then 

-S n \ = V^(l - 0)\Xn - 0o\I ^ i } (|X n - 0 O |). 

Hence, for e > 0, Pe(y/n\X n - <5 n | > c) is at most 
Pe{\X n -0o\<n-^ = f^(ft)-n"* < X n < 0o + n"*) 

= ^((0o -0)>/n-ni <Z<(0 o -0)v / n + ni^, 

where Z = >/n(Xn-0) has JV(0, 1) distribution given 9 = 0. This last probability 
goes to 0 as n goes to infinity because both of the endpoints either go to +oo or 
-oo. Hence, if 0 ^ 0o, $n = X n + o P {l/>/n). 
Now, suppose that 0 = Oq. Then 

V^|a(X n - 0 O ) + 0o - 6n\ = (1 - a)y^|X - 0o|/ [n _ 4 >oo) (|X n - 0o|). 

Hence, for e > 0, P^(y/n\a(X n - 0 O ) + 0o - <5n| > c) is at most 

P^(|X n -0o|>n-^) = i3(Vn|Xn-ft>|>n*)->0, 

as n — * oo. So, if 0 = 0 O , 6 n = 0 O + a(X - 0 O ) 4- o P (l/Vn). It follows that 
_ o) 2> N(0,v 9 ), where v« 0 = a 2 and w = 1 for all other 0. Efficiency is 
1/a 2 > 1 at 0 = 0o. 

The phenomenon of Example 7.47 is called superefficiency. It is easy to 
see how one could arrange for an estimator to be superefficient at several 
different possible 0. LeCam (1953) proved that, under conditions a little 
stronger than the FI regularity conditions, superefficiency can only occur 
at a set of zero Lebesgue measure. 



13 When an estimator is efficient, or when two estimators have ARE equal to 1, 
more detailed comparisons are often made in a study of second- order efficiency. 
We will not study second-order efficiency in this text. 

14 This example is due to Hodges; see LeCam (1953). 



7.3. Large Sample Estimation 415 



7.3.2 Maximum Likelihood Estimators 

In Section 5.1.3, we defined maximum likelihood estimators (MLE) to be 
estimators that maximize the likelihood function L{0) — fx\e( x \^)- That 
is, an MLE of 0 after observing X = x is any 0 at which L(0) achieves 
its maximum, if there are any such 0. In this section, we prove some large 
sample properties of these estimators. 

Theorem 7.48. 15 Assume that {Xn}^ are conditionally IID given 6 = 
0 each with density /xi|e(#|0)- Then, for each 0q and each 0 ^ 0 O , 



lim PL 



n n 

Ylfx l \e(Xi\9o)>Hf Xll e(X i \9) 



.1=1 i=l 



= 1. 



Proof. With P 0q measure 1, n? =1 f Xl \e(xi\0o) > FEU fx 1 \e(x i \0) if and 
only if 

R (x) = i± log f^f. <0 . 
By the weak law of large numbers B.95, under Pg 0 , 

where I Xl (0o;0) is the Kullback-Leibler information from Definition 2.89. 
By Proposition 2.92, we know that -T Xl (0 o ; 0) < 0 if 0 ± 0 O . It follows 
thatlim n ^ oo J^ 0 (jR(A')<0) = l. □ 
Theorem 7.48 suggests that the MLE should be consistent, since the P e 
probability goes to 1 that the likelihood function is higher at 0 than at 
some other parameter value. Some further conditions are required to prove 
consistency. Wald (1949) proved almost sure convergence of the MLE under 
the assumption that the likelihood function was continuous. Theorems 7.49 
and 7.54 are very much like Wald's result. 

Theorem 7.49. Let {Xi}^ be conditionally IID given 6 = 0 with den- 
Mty /*i|e(s|0) with respect to a measure v on a space (X l ,B l ). Fix6 0 € Q, 
and define, for each MCfi and x £ X 1 , 

Z(M,x) = inf log^iieW) 
^ M /x!|e(a#) 

Assume that for each 0 0 O there is an open set N e such that 9 e N e 
and E eo Z(N e ,Xi) > 0. If n is not compact, assume further that there is 
a compact C C ft such that 0 O e C and E 9o Z{Q \ C,Xi) > 0. Then, 
lime n =0 O , a.s. [P 0Q \. 



15 This theorem can be strengthened to an almost sure result. See Problem 28 
on page 471. 



416 Chapter 7. Large Sample Theory 



Proof. If fi is compact, let C — fi. It suffices to prove that for every e > 0, 
P^ o (limsup||e n -fl 0 ||>e) = 0. (7.50) 

n—KX) 

Let e > 0 and let iV"o be the open ball of radius e around 0o- Since C\N 0 
is a compact set, and {No : 0 G C\ Nq} is an open cover, we may extract 
a finite subcover, N$ 1 , . . . , N$ e . Rename these sets and C c to fii, . . . , fi m , 
so that ft = N 0 U (u^Lify), and E^ 0 Z(ft j ,X i ) > 0. 

Let X°° be the infinite product space of copies of A? 1 . Let x E A* denote 
a generic sequence of possible data values. Let E$ 0 Z(Qj,Xi) — Cj. By the 
strong law of large numbers 1.63, X^=i Z(Qj,Xi)/n — ► Cj, a.s. [P(? 0 ]. Let 
Bj C A' 00 be the set of data sequences such that convergence holds, and let 
B = njLxBj. Then Pe 0 (B) = 1 and £? =1 Z(Slj,Xi)/n -* c 3 > 0 for each 
x = (xi,X2, • • •) £ Now, notice that 

{x : limsup||6 n (xi,...,x n ) -0 O || > e} 

n — >oo 
m 

C (J {x : © n (xi, . . . , x n ) G ftj, infinitely often} 

C Mix: inf lVl og ^i4^ <Q, infinitely often} 

m ( 1 n "J m 

c (J ) x : S z ^ ' Xi ) - °' infinitel y often f £ U B f • 



Since this last set is £ c and Pe 0 (B c ) = 0, (7.50) follows. □ 
The hard part of using this theorem is verifying the conditions. 

Example 7.51. Suppose that {X n }~=i given 9 = 0 are IID with (7(0,0) distri- 
bution. Then f Xl \e(x\0) = 1/0 for 0 < x < 0. We need E 0O inf^N* g{Xi) > 0 
where g(x) is the function 



10g /xxie^l^) ^ 



log^ if x <min{0 o ,V}, 

oo if t/> < x < 0o, 

-oo if 0o < x < V>, 

undefined otherwise. 



Since the last two cases have 0 probability under P $0 , we can choose AT* - 
(\0 + 0o]/2,oo) when 0 > 0 O . In this case, Z{N e ,x) = log([0 + 0o]/[20oJ) > 0, 
a.s. [P eo \. If 0 < 0o, choose N 9 = (0/2, [0 + 0o]/2). In this case Z(N 9 ,x) = oo if 
x > [0 + 0o]/2. Hence, E 6o Z(Ne y Xi) > 0 in either case _ 
We also need a compact set C such that E 0Q Z(il \ G, Xi) > U. bet o - 
[0 o /a,a0 o ], for some a > 1. Then 

/ Xl|e Wfe) J log£ ifXi<£0o, 
,^c l ° g /x l( e(^|0) " I log a if Xi > ±fla. 



7.3. Large Sample Estimation 417 



The conditional mean of this given 0 = 0o is 



1 [« e ° x f 6 ° 
T 0 \] 0 l° 6 ^ + / lfl /oga d s 



The first integral goes to 0 and the second goes to oo as a — > oo. This means that 
there is some a > 1 such that the mean is positive. It follows from Theorem 7.49 
that the MLE is consistent. 

In this example, it would have been easier to find the distribution of 6 n and 
prove directly that it was consistent, but we will need the above calculation in 
Example 7.82 on page 432. 

Example 7.52. Suppose that {X n }^L x given G = 0 are IID with N{0, 1) distri- 
bution. It is easy to calculate 



log "7 /Vim" = x ( 9 ° ~ °) + o(0 



Ol) = g{x,e). 



(7.53) 



The minimum of this over any set occurs at 0 equal to the value in the set closest 
to x. So, if N e = (0 - e,0 + e), then E 0o Z(N e ,x) = I Xl (0 0 ;0) + E 0Q (R), where 



R 



' e (x-0) + 4 
x(0 - x) + ^ 



I e(0-x) + 



if x < 0 - c, 
if0-e<x<0 + e, 
if x > 0 + e. 



Clearly, E^ 0 (il) can be made arbitrarily small by choosing e small. Similarly, if 
C = [0o - w, 0o 4- u), for large u, then 



/ x 2 

x(0 o - x) H 



Z(C c ,x) = < 



n(x - 0 O ) + \ 
u(0 o -x) + *?- 



< x(0 o - x) + 



if x < 0o — it, 
if 0o - w < x < 0o, 
if 0o < x < 0o + u, 
if x > 0o + 



We can make the integrals over the first and last portions of this arbitrarily small 
by choosing u large eno ugh. The integral over the two middle portions equals 
u 2 /2 - exp(-w 2 /2 - l)y/2/n, which is positive for large u. 

Unfortunately, if the parameter is 8 = (M, E) and Xi ~ AT(/x,a 2 ) given 6 = 
it is not possible to find a compact set C such that Ee 0 Z(C c ,X{) > 0. 
Berk (1966) replaces this condition with a weaker condition that first appeared 
in Kiefer and Wolfowitz (1956). The proof that this weaker condition suffices for 
convergence of the MLE involves martingales and is deferred to Lemma 7.83. 
(Also, see Problem 45 on page 474 and Example 7.85 on page 434.) 

One of the conditions of Theorem 7.49 can be weakened if /x^e^l') is 
continuous. 16 



A slightly more general result can be proved by assuming that /xi|e is up- 
per semicontinuous (USC). A function / : Q — ► ]R is upper semicontinuous if 
limsup n _ >00 /(0 n ) < /(0) whenever 0 n — ► 0. Use functions possess two properties 
that are needed in the proof of Theorem 7.49. The sum of two USC functions is 
USC, and the maximum of a USC function is attained on a compact set. 



418 Chapter 7. Large Sample Theory 



Lemma 7.54, Assume the same conditions as in Theorem 7.49, except 
that we now only require that E 0o Z(N e ,Xi) > -oo. Assume further that 
fx\e(x\-) is continuous in 0 for every x, a.s. [P eo ]. Then, limG n = 0 O , a. s. 
[Peo\. 

Proof. If is compact, let C = ft. For each 0 ^ 0 O in C, let JV^* 0 be 
a closed ball centered at 0 with radius at most 1/fc such that, for each 
fc, N { e k+1) C JV< fc) C iVV This ensures that H^JV^ = {6}. So, for each 
x, Z(N^ k \x) increases with fc. For each x such that /x l)e is continuous, 
1 og[/x 1 [e(^|^o)//x 1 |e(^|^)] is continuous in So, for each fc, there ex- 
ists 6 k G N { e k) {{0 k }f =1 might depend on x) such that 17 Z(N^ k \x) = 
logl/^ie^ieoV/xxie^lfljk)]. Since 0* -> 0, 

limZ(<),x) = log^l^. (7.55) 

Since N^ k) C N e , it follows that Z(AT< fc) ,a;) > Z(N e ,x). UE eo Z(N e ,Xi) = 
oo, then we have Ee 0 Z(N^\Xi) = oo, for all k. UEg 0 Z(N e ,Xi) is finite, 
then apply Fatou's lemma A.50 to {Z{N^ k) ,x) - Z{N e ,x)}f =ll and use 
(7.55) to get 

lim inf E 0O Z(JV< fe) ,X 4 )> E 0O lim Z(W< fc) , = J Xl (flb; 0) > 0, (7.56) 

Ac— >oo k-+oo 

where 2xi is the Kullback-Leibler information. Either way, we can now 
choose k*(0) so that E 6o Z(N^\Xi) > 0, and apply Theorem 7.49. □ 

7.3.3 MLEs in Exponential Families 

In exponential families, MLEs exist (with probability tending to 1 as n — * 
oo) and are asymptotically normally distributed, and differentiate func- 
tions of the MLE are asymptotically efficient estimators 

Theorem 7.57. Suppose that {Xn}^^ are conditionally IID given 6 = 
0 with nondegenerate exponential family distribution whose density with 
respect to a measure v is 

fx 1 \e(x\0) = c(0)exp(9 T x). 

Suppose that the natural parameter space is an open subset of IR k . Let 
Q n be the MLE of 0 based on Xi, . . . , X n if it exists. Then 

• lim n _oo Pe(O n exists) = 1, 



17 If / Xl |e is USC, 9k still exists. One must change the lim to lim inf wherever 
it appears in front of Z, and one must change the equality to > in (7.55) and 
(7.56) to make the rest of the proof work. 



7.3. Large Sample Estimation 419 



• under P 0 , y/n(Q n -0) £ N k (0,I Xl (6)~ l ), wherel Xl (0) is the Fisher 
information matrix. 

Proof. If the MLE exists, it will satisfy the equation that says that the 
partial derivatives of the log-likelihood function with respect to each coor- 
dinate of 0 are 0, since the parameter space is open. Since 

d d 

logf x]e (x\0) = nxi + n— log c(<9), 

the resulting equation is x n = v{0), where the ith coordinate of v{0) is 
-d\ogc(0)/d0i. It follows from Proposition 2.70 that 

d 2 

O0id0j 

Since a covariance matrix is positive semidefinite, it follows that -logc(0) 
is a convex function. Since each X { has nondegenerate exponential family 
distribution, their coordinates are not linearly dependent, hence the matrix 
s = (i<?ij)) will be positive definite. It follows that v(-) has a differentiate 
inverse h(-) in the sense that for each 0 there is a neighborhood N of 0 such 
that h{v(if))) = i/> for ip e N, and the derivatives of h are continuous. (See 
the inverse function theorem C.2.) If x n is in the image of v, then, for at 
least one such function ft, the MLE equals h{x n ). By the weak law of large 
numbers B.95 ~Xn E 0 X under P 0 and E 0 X = v(0) by Proposition 2.70. 
It follows that X n will be in the range of v with probability tending to 1 
as n -> oo. It follows that the MLE exists with probability tending to 1. 
The multivariate central limit theorem B.99 says that under P 0 , y/n(X n - 
v{0)) -+ N(0, E). Using the delta method, we get that under P 01 ^/n{Q n - 
0) -+ Nk^AY.A 1 '), where A = ((a^)) with aij = dh^/dtj evaluated at 
t = v(0), which is the (ij) element of E" 1 . So, A = E" 1 . It is also easy 
to see that E is the Fisher information matrix l x , (0), so A/n(0 n - 0) 
NioaxAO)- 1 ). a 

The following corollary is trivial. 

Corollary 7.58. Under the conditions of Theorem 7.57, the MLE of® is 
consistent. 

Another corollary says that differentiate functions of MLEs are asymp- 
totically efficient estimators. 

Corollary 7.59. Assume the conditions of Theorem 7.57. Suppose that g : 
Sl->lRhas continuous partial derivatives. Then g(Q n ) is an asymptotically 
efficient estimator of g(Q). 

Proof. Let c 0 be defined in (7.42). Using the delta method, we get 

M9(Qn) - g(0)) Z N(O,cJl Xl (0)-' C0 ). 
It follows that g(e n ) is an asymptotically efficient estimator of g(0). □ 



420 Chapter 7. Large Sample Theory 



7.3.4 Examples of Inconsistent MLEs 

There are some curious examples of MLEs that are inconsistent. Each of 
these examples fails one or more of the conditions of the theorems on con- 
sistency that are proved in this chapter. 

Example 7.60. This example was introduced by Neyman and Scott (1948) and 
discussed by Barnard (1970). 18 Suppose that (Xi, Yi) ~ N 2 (nil,(T 2 I), given 6 = 
(cr, /ii,/X2j • • •)> and the individual vectors are conditionally independent. These 
observations are not conditionally IID, so that none of our theorems applies as 
stated. Nonetheless, we can write a likelihood function 



m = ^'"^H " ~h > !<<• - J K " 
The logarithm of this is (aside from an additive constant) 

I i=l i=l ) 

The MLEs are easily calculated as M i)n = (X< + V<)/2 and £ 2 = £" =1 (Xi ~ 
Yi) 2 /[4n]. Since the conditional distribution of X t - YJ given 6 = 9 is AT(0, 2<r 2 ), 

it follows that £ 2 cr 2 /2 under P 0 . The MLE of E 2 is inconsistent. 

Barnard (1970) suggests an empirical Bayes approach, which is to choose a 
distribution for the Mi with a fixed finite number of parameters, call them 
Then treat (#, E) as the parameters and integrate the Mi out of the problem. 
See Section 8.4 for a discussion of empirical Bayes methods. Kiefer and Wolfowitz 
(1956) let the distribution of the Mi be more general, but they assume that the 
distribution lies in a compact set of distributions so that methods like those of 
Theorem 7.49 and Lemma 7.54 can be used. 

Example 7.61. Let Q = (1/2, 1] and suppose that for x = 0, 1, 2, . . ., 

/x 1 |e(z|0) = S 2 -(*+ 1 ) if e = 1. 

This is a family of geometric distributions with 9 = 1/2 renamed to 0 = 1. The 
density is neither continuous nor USC at 6 = 1. So, Lemma 7.54 will not apply. 
We can write 

tog fr*' 8 * * = - log(20) - x log[2(l - 0)], (7.62) 
Jxi\e(x\v) 

and we note that for every compact subset C of fl, the infimum of (7.62) over 
9 e C c is 0 for x > 0 and is negative for x = 0. So the conditions of Theorem 7.49 



18 Barnard claims that Neyman presented him with the example m a taxicab 
in Paris in 1946. Barnard had just met Neyman for the first time and was 
arguing for the broad general validity of the method of maximum likelihood 
when Neyman asked him what he would do in this example. 



7.3. Large Sample Estimation 421 



fail as well. Let X n be the average of the first n observations. The MLE based 
on the first n observations is 



A _ J (1+^n)" 1 ifX„<l, 

n "\ 1 ifX„>l. 



Under Pi, y/n(X n — 1) -5- 7V(0, 2), since the mean of each Xi is 1 and the variance 

is 2. It follows that lim n -cx> P{(X n > 1) = 1/2 and 1/(1 + X n ) 1/2 under P x . 
(This is not surprising, since 0 = 1 should have been 0 = 1/2.) So, 

ii >^)=4 

and 0 n is inconsistent because it does not converge appropriately for 0 = 1. 

7.3.5 Asymptotic Normality of MLEs 

Outside of exponential families, the proof that MLEs are asymptotically 
normal is a bit more complicated than the proof of Theorem 7.57. The 
following theorem gives conditions under which the MLE is asymptotically 
normal in general parametric families. 

Theorem 7.63. Let Q be a subset of IR P ', and let {X n }~ =1 be conditionally 
IID given 0 = 8 each with density /xi|e(*|^)- Let 0 n be an MLE. Assume 

that 0 n — » 8 under Pq for all 8. Assume that f Xl \e{x\8) has continuous 
second partial derivatives with respect to 8 and that differentiation can be 
passed under the integral sign. Assume that there exists i/ r (x, 8) such that, 
for each 8q £ int(tt) and each k,j, 



sup 

110-00 ||<r 



^log/x l|e (x|0o) - J-Qf^ g f x ^(x\8) 



<# r (x,0 o ), 



(7.64) 

with lim r _>oE0 o J/ r (X, 8q) = 0. Assume that the Fisher information matrix 
Xx x (8) is finite and nonsingular. Then, under Pe 0 , 

^(6„- 00)^(0,1^(00)). 

Before we prove Theorem 7.63, here is an example in which condition 
(7.64) is met. 

Example 7.65. Suppose that f Xl \e(x\0) = (tt[1 + (x - 0) 2 ])" 1 . Then 
— Iog/ Xl|e (*|0) = -2 [1 + (x _g )2]2 . 

This is different iable and the derivative has finite mean. Hence H r exists as in 
Theorem 7.63. 



422 Chapter 7. Large Sample Theory 

The idea of the proof of Theorem 7.63 is the following. We work with 
the vector f e {X) of partial derivatives of the logarithm of the likelihood 
function divided by n. We evaluate a Taylor expansion of i' e (X) around 0q 
at the point 0 n . Since t'^ (X) should be 0, we get that t'o 0 {X) is essen- 
tially the matrix B n of second partial derivatives of the logarithm of the 
likelihood times 6 n - 0q divided by n. Since ( f 0 Q {X) is the average of IID 
random vectors with mean 0 and covariance matrix Jxi(0o)> y/nto 0 (X) is 
asymptotically normal with covariance Jxi (#())• Similarly, B n is nearly the 
average of IID random matrices, so B n — > -Xxi (#())• Setting the two sides 
of the Taylor expansion equal, we get that ^/nlx^o^Gn - 0 O ) is asymp- 
totically normal with covariance matrix Xx x (#o)- Multiplying by Xxi (^o)"" 1 
gives the desired result. 
Proof of Theorem 7.63. Let 

M^) = -Elog/x 1 |e(^l»)- 
i=i 

The ith coordinate of the gradient d e (x) is (Yn=i d^og f Xl \e{xi\0)/d0j) /n. 
Since 0 O € int(fi), there is an open neighborhood of 0o in the interior of 
Q. Since © n 0o under P^ 0 , it follows that Z n I [nt (Q)c (0 n ) = op(l/\/n) 
as n — > oo for every sequence {Z n }%L t of random variables. 19 Note that 
^ (X) = 0 for 9 n G int(fi). It follows that 

Using a one-term Taylor expansion (see Theorem C.l) of each coordinate 
of i'± (X) around 0 O , we get 



0=0* 



(e n -9o) = o P ^y (7.66) 



where 61 is between 0 O 7 and G n j for each j. Since 9 n -+ 0o under P0 O , 
0*^. .£> $ 0 j for each j. Set B„ equal to the matrix in (7.66). Then 

\(X) + B n (e n -e 0 ) = o P (J=y (7.67) 

By passing derivatives under the integral sign in the equation 

19 In fact, Z n I iM(n) c(Qn) = o P (r n ) for every r„. The reason is that it equals 0 
with probability tending to 1 and it doesn't matter what it equals when it isn t 
0. 



7.3. Large Sample Estimation 423 



we see that Eo 0 £' 6o (X) = 0. Similarly, we get that the conditional covariance 
matrix given O = 0q of i' e (X) is T Xl (0o). The multivariate central limit 

theorem B.99 gives us that y/nt 0o (X) " N(0,I Xl {Oo)). So y/nt $0 (X) = 
O p {\). It follows from (7.67) that 

yfrB n (e n -e 0 ) = o P (i). (7.68) 

Next, note that B n(jtk) = (£" =1 a 2 log/ Xl |e(^i^o)/^^) /n + A n , and 
(7.64) ensures that |A n | < X^ =1 H r (X u 0 o )/n, when ||0 O - &n\\ < r. The 
weak law of large numbers B.95 says that 



1 n p 

- / ^H r (Xi,0o) -+E0 o H r (Xi,0o). 

Let e > 0 and let r be small enough so that Ee 0 H r (Xi, 0 O ) < e/2. Then 

P^ 0 (|A n |>6) < P , eo (^^H r (XM>^+Pe o m-9*\\>r) 



< Pi 



— H r (Xi, 0o) — Ee 0 H r (Xi 1 9o) 

n. * * 



i=l 



+ Pe 0 (Po-0:\\>r). 

The last two probabilities go to 0 as n — ► oo, hence it follows that A n = 
o P (l). Hence B n -X Xl (0 o ), and B n = O p (1) but B n ^ o P (l). It follows 
from (7.68) that Vn(©n - 0o) = Op(l). Now, write B n = -Xx^o) + C n 
where C n = op(l). Then C n (Q n - 0o) = op(l/y/n), and we can rewrite 
(7.67) as 

VEe eo (X) - 2^(00) Vn(6 n - 0 O ) = o P (l). 

By Lemma 7.19, we get that -X^ (0 o )\/ra(©n -0o) ^ N(0,I Xl (Oo)). Since 
multiplication by a matrix is a continuous function, the result is proven. □ 
When applying Theorems 7.57 and 7.63 with observed data, it is common 
to replace Xxi(0o) by a matrix that does not depend on the unknown 
parameter. One possibility is X Xl (Q n ) 1 which is often called the expected 

a p 

Fisher information. In the proof of Theorem 7.63, we saw that Jxi (©n) —* 
2xi(0o) given 6 = 0o- We also saw, however, that Jxi(0o) arose in the 
theorem as an approximation to — 1/n times the matrix of second partial 
derivatives of the log-likelihood function at a point near G n (and near 0 O ). 
It has been suggested [see, for example, Efron and Hinkley (1978)] that one 
use 1/n times 

-((^ 1o s'*'^W)) < 7 - 69 ) 

in place of Jxi (©n) when X = x is observed and one wishes to use the MLE 
to make inference about 6. The quantity in (7.69) is called the observed 



424 Chapter 7. Large Sample Theory 

Fisher information. We will see later (in Section 7.4.2) that the observed 
Fisher information is indeed the appropriate matrix to use when the goal 
is to approximate the posterior distribution of a parameter by a normal 
distribution. Efron and Hinkley (1978) say that the reason for preferring 
observed over expected information is that the inverse of observed infor- 
mation is closer to the conditional variance of the MLE given an ancillary. 

Example 7.70. 20 Assume that (Xi, Zi), . . . , (X n , Z n ) are conditionally IID with 
Zi having Ber (1/2) distribution and Xi\Zi — z having N(0, l/[^-f-l]) distribution 
given G = 0. That is, we flip a fair coin before observing each Xi and if the coin 
comes up tails, we get an N(0 y 1) observation. If the coin comes up heads, we get 
an iV(0, 1/2) observation. The log-likelihood function is a constant plus 

\ £ + 1) - \ £(Z, + l)(X t - 9) 2 . 

1=1 1=1 

The MLE is the weighted average 0 n = £I=i( Z < 4- lpfc/E^Zi + 1). The 
Fisher information is Ix 1 (9) = 3n/2, which is also the expected Fisher informa- 
tion. The approximation to the distribution of B n given 0 = 0 using the expected 
Fisher information is JV(0,2/[3n]). On the other hand, the observed Fisher in- 
formation is J — ]Ci=i( z * + 1)- A natural ancillary upon which to condition 
is Z — (Zi,...,Z n ). The conditional distribution of 6 n given Z and 0 = 0 is 
N(0, 1/J), which is the same as the approximation based on the observed Fisher 
information. 

LeCam (1970) proves asymptotic normality of MLEs under ostensibly 
weaker conditions than Theorem 7.63. The conditions do not require the 
existence of continuous second derivatives. They do, however, require the 
existence of functions that behave very much like second derivatives. Also, 
a condition very much like (7.64) is required, where the second derivatives 
are replaced by these other functions that behave like second derivatives. 21 

7.3.6 Asymptotic Properties of M-Estimators* 

In Section 5.1.5, we introduced a class of estimators called M-estimators. 
These estimators can be thought of as being chosen to maximize the log 
of some alternative likelihood or some (nearly) arbitrary function rather 
than the likelihood function. For example, suppose that we choose some 
function p(a, b) and maximize p(X u 0) as a function of 6. If p(a, b) = 
log /x!|e( a l&)> then we get maximum likelihood as a special case. 

Not'every function p will be appropriate, however. The following condi- 
tions will be assumed throughout this section: 

1. For each 0 O and for all 6 £ 0 O , Eo o [p(X u 0 o ) - p(X u 0)] > 0. 



20 This is a modification of an example of Cox (1958). 
21 The paper is worth locating if only to read the author's footnote. 
*This section may be skipped without interrupting the flow of ideas. 



7.3. Large Sample Estimation 425 



2. For each 0 O and each 0 ^ 0o, there exists an open set No containing 
0 such that E 9o mU f£Ne [p(Xi,0 o ) - p(X u 0')] > -oo. 

3. For each do, there exists a compact set C containing 0 O such that 
E* 0 inf^ c [p(*iA) - p(Xi,0)] > 0. 

Condition 1 says that p allows one to distinguish possible values of © from 
each other and that p(Xj, 0 O ) tends to be larger when 6 = 0o than when 6 
equals some other value. Condition 2 says that it cannot be the case that 
even when 6 = 6$, there are some other possible values 0' of 0 that lead to 
p(Xi,0') being much larger than p(Xi,0o). Condition 3 says that there is 
a region around 0o such that all values of p(Xi, 0), for 0 not in that region, 
tend to be less than p(Xi,0o) simultaneously. If p(a,b) = log/x^eM^)* 
then conditions 2 and 3 are two of the conditions of Lemma 7.54. In fact, 
the method of proof for that lemma can be applied to prove the following 
proposition. 

Proposition 7.71. Suppose thatp(X,-) is continuous, a.s. [P$ Q \. Also, as- 
sume conditions 1-3. IfQ n is the value of0 that maximizes Y%=i p(Xi,0), 
then G n -* 0 O , a.s. [P$ 0 ], 

Next, suppose that p is different iable with respect to the second coordi- 
nate. Then, set ip{Xi,0) = dp{X u 0)/d0, and assume that E e ip{Xi,0) = 0 
for all 0. (This is the same as condition (5.39) on page 312.) Also, suppose 
that rj) is continuous. Now, we can try to solve ^?=i tP(Xi,0) = 0. If there 
is more than one solution, we can choose the one closest to some reasonable 
(with any luck, consistent) estimator of 8. The next theorem says that as 
n increases, the probability that there is a solution to this equation near 0 
goes to 1, given 0 = 0. 

Theorem 7.72. Assume that Q C JR. Let tp : X x Q, — ► 2R be such that 

• E e *p(X u 0) = O, 

• tp{x,0) is continuous in 0, 

• for each 0o, there exists 6 > 0 such that Ee o ip(Xi,0) is strictly de- 
creasing as a function of 0 for \0 — 0q\ < 6. 

Then, for each e > 0, 

lim PL- | 3 a solution of V^(Xi,0) = 0 in (0 O - e,0 o + e) ] = 1. 

n— *oo 0 \ ^— ' / 
\ t=l / 

Proof. If 0 X € (0 O - «, 0 0 ) n (0 O ~ c, 0 O ) and 0 2 € (0 O , 0 0 + 6) n (0 0 , 0o 4- e), 
then E Go ip(Xi,0i) > 0 and Eo o ip(Xi,0 2 ) < 0. By the weak law of large 
numbers B.95, under P Bo , ££=i 0i)/n ^ Ee o \p{X u 0i) > 0 and 



426 Chapter 7. Large Sample Theory 



£r=i^(*t,02)/ra E^^Xi,^) < 0. We now note that the probabil- 
ity that there is a solution equals 




and the last of these goes to 1 as n — > oo. □ 
A similar result can be proven if E Bq ^(X u 0) is nondecreasing. 

Corollary 7.73. If G n is the closest solution to a consistent estimator, 
then G n is consistent. 

Example 7.74. Suppose that f Xl \e{x\9) = {tt[1 + (x - 0) 2 ]}" 1 . A consistent 
estimator is Yi /2 , the median. The likelihood equation is 



which has several solutions in general. To check the conditions of Theorem 7.72, 
we note that ip(x,9) = (a? - 0)/[l + (x - 0) 2 \. Clearly, E e *l>{X,0) = 0 for all 0. 
The derivative of Ee 0 ^(X, 0) evaluated at 0 = 0 O is 



f s 2 -l . _ / dx f dx 1 / dx _ 

7 (x2 + l)3 dX J (X 2 + 1) 2 2 7 (x 2 + 1) 3 " 2y (xm13 <0 - 



It follows that Ee o (X,0) is strictly decreasing in 0 for 0 near 0o. If we choose 
the solution to the likelihood equation closest to the median, we have another 
consistent estimator. 

Preedman and Diaconis (1982) give examples in which M-estimators are 
inconsistent. Basically, the distribution from which the data arise has a 
density that is designed to be particularly incompatible with the function 
p (or t/)) that one uses for the M-estimation. 

The next theorem says that we can get efficient estimators without actu- 
ally finding MLEs by starting with any ^/n-consistent estimator and then 
using one step of Newton's method to try to solve the likelihood equation. 
The theorem is stated in terms of general M-estimators. 

Theorem 7.75. Let ft C lR k be open. Let 6 n - 9 0 = 0 P (l/y/n) under 
P Qo . Letip: X xSl-+ IR k be such that Efl o ^(X,0 o ) = °> and d 2 tp{x,9)/d9 2 
is continuous in 9. Define two matrices Jt — {(Jej,t)) for £ = 1,2 where 



n 




Jv,j,t = E 9o ^-Vj(^t!^o)» 

j 2;j , t = CoveoWiixMMXM)- 



7.3. Large Sample Estimation 427 



Assume that J\ is nonsingular. Also, assume that there exists H r (x) such 
that 



sup 

\\e-d 0 \\<r 



< H r {x), 

where lim r _ 0 E^ 0 H r (Xi) = 0. Let M(0) = ((m j|t (0))), where 



i=l 



Let S* n = e n - M-^Qn) Er=i 4>(Xt> ©»)• Then, under P 9o , 
x/n(e; - flo) £ N k (0, j^j 2 Jr 1T )- 

Proof. Let p(0) = £J* =1 1>(X it e)/n, so that 9; = 9„ - M- 1 (e n )p(6„). 
We can write 

p(e„)=p(tf 0 ) + Af(n(©»-flb). 

where 0* is between 8q and 0 n . It follows that under Pq q , 6* -+ 6q. By 
hypothesis, 



i=l 



00 t 



< ±f> l|e ._ M (Xi)|. 

i=l 



Since - 0 O \\ = O p (1), \m Jjt (0*) - m jtt {0 0 )\ = O p {1). By hypothesis 
(e„ - 0 O ) 2 = Op(l/n), so p(9 n ) = p(0 o ) + M(0 o )(9 n - ft>) + Op(l/n). 
Also, 



M(9 n ) = M(0 O ) + Op = Af (flo) 

It follows that 

9; = e„ - p(9 0 ) + M(do)- 1 (M(e 0 )e n - 9 0 ) + Op Q 

= 9 n - M{6 0 )- l P {e Q ) - (9„ - d 0 ) + op (-±=) . 

The last equality holds since Eo 0 ip(Xi, 6 Q ) = 0, implying that p(0 o ) = op(l) 
and because 9„ — #o = op(l). So, 



V^(9 n - 60) = -v^M(0o) _1 p(^o) + o P (l). 



(7.76) 



428 Chapter 7. Large Sample Theory 



The weak law of large numbers B.95 says that under P<9 0 , 

M(0 Q ) £ ((Efc^MX.flo))) = Ji> 
Also, the multivariate central limit theorem B.99 says that under P$ Q) 

The result now follows easily from (7.76) and Theorem 7.22. □ 

Example 7.77. Consider the Cauchy distribution with a location parameter. 
The likelihood equation is very difficult to solve, but there is a simple y/n~ 
consistent estimator, namely, the sample median B n . Set 

We now calculate 

Ee 0 ^(XM = -2Ee 0 * + ( ( *_ ^ 2]2 - -2E 0 ( * + * 2)2 



7T J (1+Z 2 ) 3 



The other conditions of Theorem 7.75 can also be verified. The following estimator 
is asymptotically efficient: 



t=i [i+(x i -e n )2]2 

A theorem like Theorem 7.63 can be proven for consistent M-estimators. 
The asymptotic distribution is the same as the asymptotic distribution 
of 6* in Theorem 7.75. The proof proceeds either by showing that the 
M-estimator differs from 6; by o p (l/ v / n) or by rewriting the proof of 
Theorem 7.63 using </> in place of ( e . The details are left to the interested 
reader. Huber (1967) also proves theorems of this sort. 



7.4 Large Sample Properties of Posterior 
Distributions 

There are three kinds of large sample properties of posterior distributions 
which we explore in this section. One kind is classical properties, such 
as consistency and asymptotic normality conditional on a parameter. Ex- 
amples include Theorem 7.80 and the asymptotic normality of posterior 
distributions as presented in Section 7.4.2. Another kind is prior proper- 
ties, where the probability statements concern the prior joint distributions 



7.4. Large Sample Properties of Posterior Distributions 429 



of all random quantities of interest. Some examples include Theorems 7.78 
and 7.120. A third kind is pointwise properties. These concern limits along 
certain data sequences, and an example is in Section 7.4.3. 

One thing that must be kept in mind about the prior properties is that, 
without further effort and/or conditions, these properties do not usually 
imply corresponding non-Bayesian properties. For example, in Section 7.4.1 
we will prove that, under certain conditions, the posterior distribution con- 
centrates around the actual value of the parameter as n increases, with 
probability 1 under the prior joint distribution of the data and the pa- 
rameter. But if the prior distribution of the parameter is concentrated on 
a small portion of the parameter space, one cannot expect the posterior 
distribution to concentrate near 8 given O = 8 for values of 8 not in that 
small portion. Some examples are given in Section 7.4.1. 

7.4.1 Consistency of Posterior Distributions* 

Doob (1949) proved a theorem that says that if there exists a consistent 
estimator of a parameter 6, then the posterior distribution of 0 is consis- 
tent in the following sense. Let fJ>e\Xi,...,x n ('\ x ii • • • > x n) denote the poste- 
rior probability measure over (fi, r) given (Xi, . . . , X n ) = (xi, . . . , x n ). Let 
A € r and let I a be the indicator function of A. The theorem says that 
Me|Xi,...,x n (^4|#i> • • • ? ^n) converges almost surely, as n — > oo, to Ja(©). 
The proof given below is adapted from Schwartz (1965, Theorem 3.2) and 
is similar to the proof of Theorem 2 of Schervish and Seidenfeld (1990). 

Theorem 7.78. 22 Let (S,^4,/i) be a probability space. Let {X l ,B l ) be a 
Borel space, and let (fi,r) be a finite- dimensional parameter space with 
Borel a-field. Let 6 : S -> fi and X n : S -> X 1 , for n = 1,2,... be 
measurable functions. Suppose that there exists a sequence of functions 
h n : X n — ► such that h n (X\, . . . ,X n ) converges in probability to 0. Let 
Me|Xi,...,x n ( , kij • • • >#n) denote the posterior probability measure on (fi,r) 
given {X\ , . . . , X n ) = (xi , . . . , x n ) . For each A Gr, 

lim fi e \ Xu ...,x n (A\X 1 ,...,X n ) = I A (0), a.s. [fj]. 

n — ►oo 

Proof. According to Theorem B.90, there is a subsequence {rik}kLi such 
that Zk = h nk (Xi,...,X nk ) converges to 0, a.s. Let A e r, and let 
Z = limfc-*oo Zk when the limit exists and Z = So otherwise, where #o £ fi- 
Then Ia(®) = Ia(Z) a.s. Since Z is measurable with respect to the a- 
field generated by {Xn}^=i> part I of Levy's theorem B.118 says that 



+ This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 

22 The proof relies on martingale theory. The proof in Doob (1949) does not 
require that a consistent estimator exists, but it has a slightly stronger assumption 
which implies that a consistent estimator exists. 



430 Chapter 7. Large Sample Theory 



E(Ia(Z)\Xi, . . . ,X n ) converges almost surely to Ia(Z). Since Ia(Z) = 
7 A (G), a.s., E(7 A (Z)|X 1 ,...,X n ) = /ie|x lf ...,x n (A|Xi, . . . ,X n ), a.s., and 



The intuitive meaning of this theorem is that when a consistent esti- 
mator of G exists, the posterior distribution of G will tend to concentrate 
near the true value of G, with probability 1 under the joint distribution of 
the data and the parameter. A consistent estimator of G fails to exist if 
the parameter is not identifiable through the sequence of data values. The 
parameter M in Problem 16 on page 75 is an example of such a noniden- 
tiflable parameter. Note that there was no explicit mention of the prior 
distribution of the parameter in Theorem 7.78. Since the theorem makes 
claims only about probabilities calculated under the joint distribution of 
the data and the parameter, it holds for all prior distributions. If the prior 
"misses the true value," then the conclusion to the theorem is not very 
interesting. 

Example 7.79. Suppose that person A has a prior for G that is concentrated 
on a set C C and person B believes that G is concentrated on a set D C Q. 
Suppose that E is an open set such that C C E and that the closures of D and 
E are disjoint. Person B believes that Ie(0) = 0 with probability 1, and person 
A believes that 7e(G) = 1 with probability 1. Oddly enough, they both believe 
that the limit of the posterior probability of E is Je(G) with probability 1; they 
just don't agree on which of the two possible values is(G) will equal. 

Example 7.79 raises an interesting question. If two people (A and B) have 
different models for data, what does person A believe about the asymptotic 
behavior of the posterior distribution calculated by person B? Berk (1966) 
proved a theorem like Theorem 7.80 for the finite-dimensional parameter 
case. Strasser (1981) used a result of Perlman (1972) to prove a theorem for 
more general parameters. Theorem 7.80 says that under conditions similar 
to those that guarantee consistency of MLEs, if person B uses a parametric 
family with parameter G and person A believes that the data are IID 
with the distribution corresponding to G = 9 0 in the parametric family, 
then person A believes that person B's posterior will asymptotically be 
concentrated on the set of 0 values such that IxAOo',0) is small, where 
X Xl (0 o ;0) is the Kullback-Leibler information. 



Theorem 7.80. Assume the conditions of Theorem 7.49 or of Lemma 7.54. 
For e > 0, define C e = {9 : I Xl ( 0 o;9) < t}- Let // e be a prior distribution 
such that ^e(Cc) > 0, for every e > 0. Then, for every e>0 and open set 
N 0 containing C e , the posterior satisfies limn-oo lAe\x»(No\x n ) = 1, as. 
[Pe 0 ], where X n = (X 1? . . . , X n ) = (x u . . . , x n ) = x n are the data. 

Proof. For each x 6 X°°, the infinite product space of copies of X 1 , define 



the result is proven. 



□ 




7.4. Large Sample Properties of Posterior Distributions 431 



Write the posterior odds of No as 

f N o d/*e|x»(0|* B ) I N c nr=i fx^iieWeie) 

J No exp(-n£> n (fl, x))dne(0) 
J N c exp(-nD n (d,x))dne(6) ' 



(7.81) 



The idea behind the remainder of the proof is the following. For each £ in a 
set with probability 1, we find a lower bound on the numerator of the last 
expression in (7.81) and an upper bound on the denominator such that the 
ratio of these bounds goes to oo. 

First, look at the denominator of the last expression in (7.81). Just as 
in the proof of Theorem 7.49, construct the sets fix, . . . , fi m so that £2 = 
No U (Ujiify), and E^ 0 Z(fi j ,X i ) = cj > 0. It is easy to see that for 



wc n (M)>jEW4 
i=i 

So the denominator of the last expression in (7.81) is at most 

m p m 

J2 exp(-nD n (9,x))dfjie(0) < V sup exp(-nD n (0,z))/ie(fy) 



3 = 

71 



< ]Texp -^Z(n„^) Ue(^). 
j=i \ i=i ) 

For each j, the strong law of large numbers 1.63 says there exists Bj C X°° 
such that Pe^Bj) = 1 and such that, for every x e Bj, there exists an 
integer Kj{x) such that n > Kj(x) implies ^=1 Z^^x^/n > Cj /2 > 0. 
Let c = min{ Cl ,...,c m }, B = D^Bj, and N(x) = max{K u . . . , K m }. 
For each x e B and n > N(x), the denominator of the last expression in 
(7.81) is at most exp(-nc/2). 

For the numerator, let 0 < 6 < min{e, c/2}/4. For each x e X°° or 0 e ft, 
let 

W n (x) = {0:Dt(0,x)<I Xl (p 0 ;9) + 6, for all * > n}, 
V n {0) = {x:D e {e,x)< I Xi (e 0 ;0)+6 y for all £ > n}. 

For each 0, the strong law of large numbers 1.63 says that D n (0,x) 
Z*i(0o;0), a.s. [P„ 0 ], so Po 0 (V%LiV n (6)) = 1. Now use this fact together 
with the fact that the sets V n (9) are increasing and the fact that x e V n (0) 
if and only if 6 e W n (x) to write 



432 Chapter 7. Large Sample Theory 




= £2L / / KmWdPeoWdnQiO) 

n—><x> In- I voo 



JCs J X°° 




lim Ve(C 6 nW n (x))dPe Q (x). 

n— >oo 



Since /xe(C 6 n W n (x)) < ve(C 6 ) for all x and n, we have lim^oo fie{C 6 n 
W n {x)) — fis(Cs), a.s. [Po Q ], because strict inequality with positive prob- 
ability would contradict the above string of equalities. So, there is a set 
B' C X°° with Pe 0 (B') = 1 and for every x € B' , there exists N'{x) such 
that n > N'(x) implies /x e (C 6 n W n (x)) > fi e (C 6 )/2. So, if x € £' and 
n > N'{x), the numerator of the last expression in (7.81) is at least 



since I Xl (9o\0) < 6 for 6 e Cs. It follows that if x E B n B 7 and n > 
max{AT(x), AT^x)}, then the ratio in (7.81) is at least /xe(C«) exp(nc/4)/2, 



Example 7.82 (Continuation of Example 7.51; see page 416). Suppose that 
{I n }S° = i given 0 = 0 are IID with 1/(0,0) distribution. We saw earlier that 
the conditions of Theorem 7.49 are satisfied. The Kullback-Leibler information 



The set C e is the interval [0 O , exp(e)0o). An open set N 0 containing this interval 
will need to contain an open interval (0o - 6, 0o) for ^ > 0. So long as the prior 
distribution assigns postive mass to every open interval, then, for every 0o, every 
open interval around 0o will have posterior probability going to 1, a.s. [Pe 0 ]- This 
is a much stronger claim than one could infer from Theorem 7.78. 

As we noted earlier, the conditions of Theorem 7.49 fail in some mul- 
tiparameter problems. Berk (1966) proves the following lemma, giving a 



23 See Problem 48 on page 474 to see why the posterior probability of C € does 
not go to 1 almost surely. 




which goes to oo with n. 



□ 



is 




7.4. Large Sample Properties of Posterior Distributions 433 



slightly weaker condition that holds in more cases. 24 

Lemma 7.83. In the notation of Theorems 7.49 and 1.80, instead of as- 
suming that there is a compact set C such that Ee 0 Z(Q\C, Xi) > 0, assume 
that there exist an integer p and a compact set C such that 

Edo inf Vlog ^^'y >0. (7.84) 

Then D n (6,x) is bounded below, uniformly in 0, by a random variable that 
converges almost surely to a positive value. 

Proof. Let N p>n be the collection of all subsets of size p of distinct elements 
of the set {1, . . . , n}. Denote such subsets a = {ai, . . . , a p }. For y e X 1 , let 
g(0, y) stand for log fx\e(y\6o)/fx\e{y\0)- It is clear that for every a e N p , n 
and each 6 £ C° , 

1 p 1 p 

- £ x °i ) ^ ™f c 

If we add both sides of this inequality for all a e N and divide by (£) , we 
get 

Call the right-hand side of this expression G n , p {x). (Note that G P>P (X) is 
the random variable in (7.84) and that G n , p (x) does not depend on 0, so 
that it is a uniform lower bound.) Due to the symmetry with respect to 
permutations of coordinates, it is clear that X\ , . . . , X n are conditionally 
exchangeable given jF n , the <r-field generated by {G n +i, p }-^ 0 - ^ f°H° ws 
that for every a € N p>n , 

E„ 0 (G p , p (X)|.F n ) = E, 0 ^inf c i^5(^,X ai ) ^ 

= E 9o (G n , p (X)\? n ) = G n , p (X). 

Now, apply part II of Levy's theorem B.124 to conclude that G niP (X) 
converges almost surely to Ee 0 (G PyP (X)\J r oo ), where J 7 ^ = H^Lp^. Since 
^"oo is a sub-cr-field of the tail cr-field of the sequence {X n }^ =1 , the Kol- 
mogorov zero-one law B.68 says that Eo 0 (G p ^ p (X)\J r (X >) is constant a.s. 
[Pe 0 ]- The constant must be E^ 0 (G P)P (X)) > 0. So, lim n _>oo G n , p (X) = 
E0 o (G P)P (X))>O,a.s. [P 0O ]. # ^ □ 



24 The proof of Lemma 7.83 involves martingale theory. Lemma 7.83 also pro- 
vides a weaker condition under which the MLE converges a.s. [Pe 0 ]. 



434 Chapter 7. Large Sample Theory 



Example 7.85 (Continuation of Example 7.52; see page 417). Suppose that 
{^n}£Li given 6 = (/i,<r) are IID with AT(/i,a 2 ) distribution. If 0o = Oxo,0"o), it 
is easy to calculate 



where x = (x\ -f X2)/2 and s 2 = (xi — #2) 2 /4. (Without loss of generality, assume 
/io = 0 and gq — 1 for the rest of this example.) Let C be the rectangle where 
x £ [—u,u] and s 2 G [l/v,v]. The integral of (7.86) for (x,s 2 ) £ C can be made 
negligible by choosing v and u large. For (x, s ) 6 C, the minimum of (7.86) will 
occur at one of the points (i) (x, v), (ii) (x, l/v), (iii) (u, s 2 + (x — u) ), or (iv) 
(—it, s 2 + (x + u) 2 ). By choosing v large enough, one can check that case (ii) can 
be ignored and that (7.86) is very large in case (i). In case (iii), (7.86) equals 
1 + log(s 2 + (x — u) 2 ) — s 2 — x 2 . The integral of the last two terms (over the entire 
sample space) is —1.5. By choosing u sufficiently large, the integral of the first 
two terms over the region where they add to less than 1.5 can be made negligibly 
small. This is similar for case (iv). So we can ensure that the minimum of (7.86) 
over C c has positive integral. 

A famous example, in which the conditions of Theorem 7.80 fail, was 
given by Diaconis and Freedman (1986a, 1986b). It concerns an infinite- 
dimensional parameter space and a prior distribution constructed from the 
Dirichlet process described in Section 1.6.1. It is shown that, conditional 
on the data arising from a cleverly chosen continuous distribution, the 
posterior mean of a particular function Y of the parameter is not consis- 
tent. Initially, what is surprising about this example is the following. Since 
continuous distribution functions can be approximated arbitrarily closely 
by discrete distribution functions, one might think that continuous dis- 
tributions are "close" in some sense to those distributions on which the 
Dirichlet process concentrates. The problem is that the sense of closeness 
is not Kullback-Leibler information. Rather, it is based on convergence in 
distribution. 25 Although Theorem 7.78 can be used to show that the pos- 
terior mean of Y converges in probability to Y given distributions in a set 
C with prior probability 1, the convergence does not extend to parameter 
values that are "close" to C in the sense of convergence in distribution. (See 



25 There is a simple way to understand why inconsistency arises in the Diaconis 
and Freedman (1986a, 1986b) example. As Barron (1986) pointed out, when 
the data come from a continuous distribution, the posterior for Y is the same 
(with probability 1) as what one would get if one assumed that the data were 
conditionally IID given Y with the distribution given by the normalized base 
measure of the Dirichlet process. (See Lemma 1.104.) Since the normalized base 
measure that Diaconis and Freedman (1986a, 1986b) use looks absolutely nothing 
like the distribution that actually generates the data, it is not surprising that the 
posterior mean of Y is not consistent. In fact the distribution that generates 
the data is chosen to be particularly incompatible with the base measure of the 
Dirichlet process in much the same way that the examples of inconsistent M- 
estimators were constructed by Freedman and Diaconis (1982). 



log 



fx u x 2 \e(xi,x 2 \0) 



= 2 log — + s' 




°l + * 2 



(7.86) 



7.4. Large Sample Properties of Posterior Distributions 435 



Problem 47 on page 474.) Theorem 7.80 suggests that the type of closeness 
that implies consistency in Bayesian problems is much stronger. 26 

7.4.2 Asymptotic Normality of Posterior Distributions 

Walker (1969) first proved that, under some conditions, the posterior dis- 
tribution of a one-dimensional parameter would look more and more like a 
normal distribution as more conditionally IID data were collected. Dawid 
(1970) proved a similar result under weaker conditions. Heyde and John- 
stone (1979) later extracted the essence of Walker's proof to show that it 
could extend to sequences of data that were not necessarily conditionally 
IID. 27 Johnstone (1978) contains a multiparameter version of the theorem 
of Heyde and Johnstone (1979). Still others— Brenner, Fraser, and McDun- 
nough (1982), Fraser and McDunnough (1984), and Chen (1985)— prove 
that, under certain conditions, the likelihood function (or the posterior 
density) converges (in probability or almost surely) to a normal density. 
The type of asymptotic normality proven by Walker (1969) and by Heyde 
and Johnstone (1979) follows from the convergence of the posterior density. 

In this section, we present a hybrid of the various theorems mentioned 
above. First, we prove that the posterior density of a suitable transforma- 
tion of the parameter vector converges to a normal density in probability. 
We will then use this to conclude that posterior probabilities converge in 
probability to multivariate normal probabilities. The general situation in- 
volves a sequence of random quantities X n : S -> X n , for n = 1, 2, . . . and a 
parameter G : S -> IR* such that the conditional distribution of X n given 
6 = 6 has a density / Xn |e(^) with respect to a a-finite measure v n on 
X n . We use the notation 

M*) = log/*„,e(Xn|0), C(t) = J) . (7 . 87) 

Let 0„ stand for the MLE of 6 if it exists, and let 28 

£ = / ~^n -1 (©n) if the inverse and © n exist, 

I h if not. v"-»8) 

The following regularity conditions are used in the general theorems. 

26 Barron (1988) gives necessary and sufficient conditions for the posterior dis- 
tribution to concentrate on sets "close" to the distribution that generates the 
data. His results even apply in nonparametric settings. 

What Heyde and Johnstone (1979) did (whether intentionally or not) was 
to take the conclusions Walker (1969) derived from the assumption that the 
data were conditionally IID, and use them as assumptions. To use Heyde and 
Johnstone s result for the conditionally IID case, one need only repeat the portion 
of Walker s proof in which the assumptions of Heyde and Johnstone's theorem are 
proven directly. Alternatively, one could prove the assumptions independently. 

Notice that E„ is the observed Fisher information matrix. 



436 Chapter 7. Large Sample Theory 



General Regularity Conditions: 

1. The parameter space is f] C IR fc for some finite k. 

2. 0 O is a point interior to Q. 

3. The prior distribution of 0 has a density with respect to Lebesgue 
measure that is positive and continuous at 0q. 

4. There exists a neighborhood Nq C Q of 0q on which t n (Q) is twice 
continuously differentiate with respect to all coordinates of 0, a.s. 

[Pool 

5. The largest eigenvalue of E n goes to 0 in probability. 

6. For 6 > 0, define iV 0 (<5) to be the open ball of radius 6 around 0o- Let 
\ n be the smallest eigenvalue of E n . If N 0 (S) C f), then there exists 
K(6) > 0 such that 



lim P' eo sup \ n [e n {0) - e n (0 o )} < -K(6) = 1. 
n ^°° \oen\N 0 (6) ) 



7. For each e > 0, there exists 6(e) > 0 such that 

lim Pi ( sup 1 + 7 T S^4(0)e| 7 



<e =L 



A few words of explanation of these conditions is in order. The first speaks 
for itself. The second avoids having likelihood functions that are largest near 
the boundary of Q, and hence cannot look like normal densities. The third 
ensures that the prior density doesn't destroy the asymptotic normality 
of the likelihood function. The fourth is one of two smoothness conditions 
which also rules out distributions for which the support of the distribution 
depends on 9. Condition 5 ensures that the amount of information in the 
data about all aspects of 6 increases without bound. Condition 6 ensures 
that the MLE is consistent and that the likelihood function can be ignored 
for values not near 0 O - Condition 7 is a smoothness condition on the amount 
of information in the data about 6. 

To be specific about what we mean by saying that the posterior distri- 
bution will look more like a normal distribution as more data are collected, 
consider the posterior probability that E~ 1/2 (B - 0„) € B as a statistic 
T n (function of the data) prior to observing the data. Then T n converges 
in probability (under P 0o ) to the multivariate normal probability of B. We 
can also make corresponding claims about the posterior density of 0. 



7.4. Large Sample Properties of Posterior Distributions 437 



7.4.2.1 Posterior Densities 

First, we prove that the posterior densities of a sequence of transforma- 
tions of the parameter converge in probability uniformly on compact sets 
to a multivariate normal density. Since we expect that the posterior den- 
sity of 8 will become more and more concentrated around 6 0 given 6 = 0o, 
the posterior density of 0 itself should not have an interesting asymptotic 
behavior. Rather, if we rescale 0 so that its variance is approximately con- 
stant (as a function of sample size), then perhaps the transformed random 
variable will have an interesting posterior density asymptotically. 29 

Theorem 7.89. Assume the general regularity conditions, and let 0„ be 
an MLE ofG. Define t n as in (7.87), and let E n be defined by (7.88). Let 
\I> n = E n (0-0 n ). Then the posterior density ofty n given X n converges 
in probability uniformly on compact sets to the N fc (0, ifc) density <j> given 
0 = 0 O . That is, for each compact subset B of IR k , and each e > 0, 

n^So^D ^Pl/»n|x w W n )-^)| >cj =0. 

Proof. First, note that general regularity condition 6 guarantees that 0 n 
is consistent, since, for each 6 the probability goes to 1 that 0 n is inside of 
No(6). Use Taylor's theorem C.l to write 

/x n! e(X n |0) = fx n \e( x n\®n)exp{e n (9)-e n (G n )} 

= /x n |e(X n |0 n )exp|-i(0-0 n ) T E;^ (7.90) 

(I k - 7^(0, X n ))En~5 (0 - 0 n ) + A n j, 

where 

An = (^e n ) T 4(0 n )/ int(n) (0 n ), 

Rn(e,X n ) = I k + E^ n (0 n )En, 

with 0* between 0 and 0 n . Since 0 O 6 int(ft) and 0 n is consistent, it follows 
that lim^oo P' eQ {k n = 0, for all 0) = 1. Now we can write the posterior 
density of 0 as 

f raw \ / ,o\ fxn\e(X n \0) 
fe\x n {0\Xn) = /e(0)— 7 — , v r , 
/x n (A n ) 



It is interesting to note that LeCam (1970) proves asymptotic normality of 
MLEs by first showing that the logarithm of the likelihood function (as a function 
of t — y/n(0 — 0o )) is asymptotically quadratic with the same distribution as 
-t T l Xl {0 0 )t/2 + t T Y } where Y ~ N k (0, l Xl (0o))- The maximum of this function 
is t = l Xl %)~ l Y, which has N fc (O,lA' 1 (0o)" 1 ) distribution. 



438 Chapter 7. Large Sample Theory 
where 

/*„(•) = / /e(0)/x n) e(-|0)<*0. 
The posterior density of * n , f\f, n \x n (i>\Xn), can be written as 

|Sn| Vx n |e(*n|6n)/e(s£v> + e n ) / x „|e(XnriV> + 9 n ) 
/x.(X») ~ ~" / x „|e(X n |6 n ) 

Our first step is to see how the first factor in (7.91) behaves as n — ► oo. 
Choose 0 < e < 1 and let r\ be such that 

^ 1-7/ , ^ I + 77 
l-e< V, l + e> 



(1 + •/)*' ~ (l-i?)*' 

Since the prior is continuous at 0o> there exists 61 > 0 such that — 6^0 II < 
61 implies |/e(0) - /e(0o)| < vfe(Go)- By general regularity condition 7, 
there exists 62 > 0 such that 

lim ( sup In- 7 T e1<(0)e| 7 | < J = 1. (7.92) 



n— ►00 



^€iV 0 (« 3 ),||7ll = l 

Let S = min{6i,6 2 }- Write fx n (X n ) = Ji + J2, where 
jj = / /e(0)/x n |e(*n|0)<0, h = / fe(O)f Xn \e(X n \O)d0. 

Jn 0 (6) JQ\Nq(6) 

Use (7.90) to write 

J X = fx n \e(Xn\e n ) f /e(«)exp{-5(fl-8 n ) T E; 4 

*/N 0 (<5) I ^ 

(J fc - i?n(e,X n ))En *(* - 6 n ) + A n j<0. 

Because <5 < it follows that 

(i - v)j 3 < f(0)f Jl (x < d + ^) J 3- ( 7 - 93 ) 



where 



•/jV 0 («) I 2 



7.4. Large Sample Properties of Posterior Distributions 439 



It follows from (7.92) and the consistency of 6 n that the limit as n — ► oo 
of P' do of the intersection of {A n = 0} with the following event is 1: 



U 



No (6) 



j 



exp 



< / exp 



1+r? 
2 

' 1-7/ 



(e-e n ) T E- l (o-e n ) 



de < j 3 



We can write the two integrals that bound J3 above as 

Jn 0 (s) I 1 > 
= (27r)*(l±»f)-*|£ n |**(C„), 
where $(C„) is the probability that an Nk(0, Ik) vector is in C n , and 

C n = {t : 0„ + (1 ± tj)-*£$t e N 0 (6)}. 

1/2 jP 

By general regularity condition 5, E n t = op(l) for all t, so $(C n ) — ► 1. 
Hence, 



n— ►00 u 



(2,r)«J==!* <J,<( Jh r)*.' E - 1 * 



= 1. 



By the way we chose rj related to e, we get from (7.93) and (7.94) that 

Ji 



(7.94) 



lim PL 



(2»)*|E„|*(l-e)< 



/e(«o)/x„|e(-y»|e n ) 



<(27r)4|S n |^(l + e) 



= 1. 



In other words, 



|2„|*/jt n |e(*»|e„) 
Next, we show that 

h 



(27r)'/e(flo). 



0. 



|Sn|*/x„|e(^n|e n ) 

Using (7.90), we can write 

^2 = /x n |e(*»|e„)expM0 o )-M©n)l 



(7.95) 



(7.96) 



/ fe(e)ex P [e n (e)-e n (e 0 )}de. (7.97) 

JSi\No(S) 



440 Chapter 7. Large Sample Theory 

Now, refer to general regularity condition 6. Since A n < lEn) 1 /*, if 0 & 
N 0 {6), then £ n (0) - £ n (0 o ) < -\H n \' l/k K(S) with probability tending to 
1. Hence, the integral on the right-hand side of (7.97) is less than 

exp[-|E n rtff(6)] [ f e {0)<W < exp(-.|E n |-*K(6)) , 

Jn\N Q (6) v 7 

with probability tending to 1. Since 6 n is an MLE, exp[^ n (0 o ) -< n (8n)] < 
1, and general regularity condition 5 says 

exp[-|En|-*Jf(g)] p 
|S„|i 

So (7.96) holds. Combining (7.95) with (7.96), we get 

fxAX n ) p (27r) | /e(0o) . (7 . 98) 



|Sn|5/x„ie(X„|e n ) 



Since 0 n is consistent, and the prior is continuous at 6q, we have that 
/e(En /2 i/> + e n ) /e(0o) uniformly for ^ in a compact set. It follows that 

lSnl*/x n |e(X w |e n )/e(S^ + 6 w ) P 

/X n (An) 

uniformly on compact sets. 

To complete the proof, we need to show that the second fraction in (7.91) 
converges in probability to exp(-||V>|| 2 /2) uniformly on compact sets. Prom 
(7.90), we get 

/ Xn|e (x n is|^ + e n ) 

/X n |e(^n|©n) 
= «p j-^ T (/fc - Rn{& + 6„, X n ))1> + A n j . 

Let 77, e > 0, and let B be a compact subset of IR fc . Let b be a bound on 
\\ip\\ 2 for ^ € B. By general regularity condition 7, there exists 5 and M 
such that n> M implies 



%>( sup |l+7 T S^nWS^7 <?] 

\tf€No(«),||7ll = l 1 / 



<rl>l-|- 



Let AT > M be large enough so that n > N implies 

P^ 0 (eIv + ©n € No(«), for all ^ e fi) > 1 - |- 



7.4. Large Sample Properties of Posterior Distributions 441 



> 1-e. 



Then, if n > N, 

P'o 0 (\^ T {h ~ Rn&h + e n ,X n ))rP - M 2 \ < r) for all ^b) 

Since P^ Q (A n = 0, for all ip) — ► 1, it follows that the second fraction in 
(7.91) is between exp(-r/)exp(-||^|| 2 /2) and exp(r/)exp(-||^|| 2 /2) with 
probability tending to 1, uniformly on compact sets. Since 77 is arbitrary, 
the desired result follows. □ 
We now give two examples in which X n does not consist of conditionally 
IID coordinates, but the general regularity conditions still hold. 

Example 7.99. Let Q be the interval (-1, 1). Let {Z n }n=i he IID 7V(0, 1) and 
let Y 0 = 0. Define Y n = 0Y n -i + Z n for n = 1,2,.... The sequence {Y n }n=i 
is called a first- order autoregressive process. The Y» are clearly not conditionally 
IID given 6 = 6 except for the case 0 = 0. Let X n = (Yi, . • . , Y n ). Then £ n {0) 
is a constant plus - (Y n 2 + (1 + 0 2 ) Y^Zl Y ? ~ 20 E?=i W-i) / 2 - The MLE is 

easily calculated as G n = ]C i=1 ^i^i-i/ S£=i The nrst ^ our g enera l regu- 
larity conditions are trivially satisfied if the prior has a continuous density. Also, 
= - Sr=i Since ^ n does not depend on 0, general regularity condi- 
tion 7 is satisfied. Since 



Cov^Y, 2 , Y*_ k ) = 9 2k VnTe(Yl k ) < 



26 2 



(1-e 2 ) 2 ' 



it is easy to show that lim n — 00 Vare(^^ =1 Y?_i/n) = 0. Hence E n 0 and n£ n 
converges in probability. Thus, general regularity condition 5 holds. Finally, note 

that, given G = 0 O , ]C?=i ***i-i/n 0o- Since 



ln(0) - i n {0o) = 



0-0o 



(0 + 0o)^Y i 2 -2^Y i Y i _i 



it follows that general regularity condition 6 holds. 

Example 7.100. Let Yi, Y 2 , . . . be conditionally independent given 0 = 0 with 
Yi ~ N(0,i). Let X n = (Yi, . . . , Y n ). The logarithm of the likelihood is a con- 
stant plus i n (o) = - (sr=i lo sW + ^ - 0) 2 A) / 2 - The MLE is easil y seen to be 

= (Er=i / (IXi V*)- The second derivative of £ n (0) is - £?=i 1/t, 
which does not depend on 0, so general regularity condition 7 holds. Since E n = 
VEr=i = ^(Vl°g( n ))> general regularity condition 5 holds. Since 



and 



Xn[£n(0)-£n(e O )} = 



En 1 



fl-flo 

En 1 
i-1 i 



N ft,, 



n 



it follows that A n [^ n (0) -^n(#o)] — ► -(0-0o) 2 /2, so general regularity condition 6 
holds. The first four general regularity conditions hold if the prior is continuous. 
Note that, in this example, the MLE is not v^- cons i sten ^» hut the posterior 
distribution is still asymptotically normal. 



442 Chapter 7. Large Sample Theory 

7.4.2.2 Posterior Probabilities 

We have proven that the sequence of posterior densities of \I> n converges in 
probability uniformly on compact sets to the N(Q, h) density. This makes 
it easy to conclude that posterior probabilities converge in probability as 
well. 

Theorem 7.101. Assume the general regularity conditions, and let 6 n be 
an MLE ofQ. Let B c 2R fc be a Borel set Define C as in (7.87), let E n be 

defined by (7.88), and let tf n = £n 1/2 (0 - 9 n ). Then Pr(# n € B\X n ) ^ 
$(B), under Pq q , where $(B) stands for the probability that an iV fc (0, h) 
vector lies in B. 

Proof. First, suppose that B is a subset of a compact set, and let c be 
the Lebesgue measure of B. Then 

|Pr(* n 6 B\X n ) - < / |/» n|J c B (V|X»)-MO|d0 

Jb 

< CSUp |/* n |X n W>|*n) - 0(^)1- 

ifreB 

This goes to 0 in probability by Theorem 7.89. 

Next, let B be an arbitrary Borel set, and let e > 0 be given. Let B e be 
a compact set such that $(B € ) > 1 - e/3. Let N be large enough so that 
n> N implies both of the following: 

Pflo(|Pr(*n€B e |X n )-*(B e )l>|) < |> 
Pe 0 (\PT(* n eBnB e \X n )-*(BnB e )\>^) < |. 

We know that 

\Pr(* n eB\X n )-$(B)\ < \Pr(* n £BnB € \X n )-$(BnB e )\ 

+ Pr(^ n €Sf|X n ) + $(Bf). 

So, if n > AT, 

P, 0 (|Pr(^ n e B\X n ) - *(B)| >€)<€. □ 

7.4.2.3 Conditionally IID Random Quantities 

Walker (1969) proves a result like Theorem 7.101 in the case of conditionally 
IID random quantities. He gives a long list of regularity conditions and then 
proves that they imply the general regularity conditions. These conditions 
are very similar to those of the Cramer-Rao inequality and those used to 
prove asymptotic normality of MLEs. Rather than repeat those conditions 
here, we will use the conditions already stated elsewhere in this book. 



7.4. Large Sample Properties of Posterior Distributions 443 



Theorem 7.102. Let {Vn}^Li & e conditionally IID given 6, and let X n = 
(Yi, . . . ,F n ). Suppose that (7.64) and the conditions of either Theorem 7.49 
or Lemma 7.54 30 hold for {Y n }^L lf and that the first four general regular- 
ity conditions hold. Also, suppose that the Fisher information Xx Y (#o) is 
positive definite. Then the remaining general regularity conditions hold. 

Proof. That general regularity condition 5 holds follows from the fact 

that nE n Zxi(0o) -1 - For general regularity condition 6, let 6 > 0 and 
let Z(-, •) be as in Theorem 7.49. Then 

sup [tn(9) - £ n (0 o )} = - a inf \e n (0 o ) - t n (0)] (7.103) 

< - miJ M\e n (e o )-e n (0)l . . . , m [*„(0 o M» W], jgf [4(0o)-4(0)]l 

<-min|^Z(ni,X i ),...,^Z(n m ,X < ), 1, 
1 1=1 i=l J 

where fix, . . . , fi m are as in the proof of Theorem 7.49. Since 
^ , £ i Z(il J ,X i )^E 0o Z(Sl jl X i ) 

n »=1 

and these means are all positive for j = 1, . . . , m, it follows that if 
K(S) < min {E^ZifluXi), E<, 0 Z(fi m , X t )} , 

where A is the smallest eigenvalue of Jxi (#o)> then general regularity condi- 
tion 6 holds. For general regularity condition 7, let e > 0 and let S be small 
enough so that E0 o if$(Yi,0o) < e/(/z + e), where Hs comes from (7.64) 
and \i is the largest eigenvalue of Ixi(^o)- Let /x n stand for the largest 
eigenvalue of E n . For 0 € AT 0 (<$), we have 



sup 

IHI=i 



1 + 7 T e!0)£| 7 



sup 

IMI=i 



7 T E|[E- 1 +^W]E| 7 



< /i n sup W[v- l +C{o)h\ 

Il7ll = l 1 1 



< Ml 



J sup 

\IMI=i 



7 T [E; 1 +4(0 o )h 



+ sup 

Il7ll = l 



30 Note that the condition of Lemma 7.83 could be used instead. 



444 Chapter 7. Large Sample Theory 



If 0 n G iVo(<5) and \fi - njz n | < €, it follows from (7.64) that the last 
expression above is no greater than 



By the weak law of large numbers B.95 and our choice of <5, this last expres- 
sion converges in probability to something no greater than e. This implies 



To make the theorems of this section apply to prior probabilities not 
conditional on the parameter, suppose that the prior distribution satisfies 
Pr(0 G int(fi)) = 1. We can now apply the result from Problem 6 on 
page 468 to conclude that the prior probability that the posterior after n 
observations will be within e of the normal approximation goes to 1 as n 
goes to infinity. 

Example 7.104. Suppose that Xi, . . . , Xio are conditionally IID given 0 = 0 
with Cau(6 y 1) distribution. Suppose that the observations are 



Then the MLE of 0 is 0 iO = 4.531, and *i 0 (4.531) = -1.23116. So, Theo- 
rems 7.101 and 7.102 suggest that JV(4.531, 0.813) is approximately the distribu- 
tion of 0. To see how good this approximation is, look at Figure 7.105. The solid 
line is an approximation to the posterior by numerically integrating the likelihood 
times the prior (trapezoidal rule from 0 = — 1 to 0 = 11). The dotted line is the 
normal approximation. The functions are not particularly similar. For example, 
the normal approximation to Pr(0 > 5|X = x) is 0.3015, while the numerical 
integral under the posterior curve is 0.3560. 




general regularity condition 7. 



□ 



-5, -3, 0, 2, 4, 5, 7, 9, 11, 14. 



G> 



Numerical Int. 
Normal Appx. 

sss 




o 



2 



6 



e 



Figure 7.105. Posterior Density for 0 in Cauchy Example 



7.4. Large Sample Properties of Posterior Distributions 445 



7.4.2.4 Loss of Information* 

For the case of conditionally IID random quantities, we can try to answer 
the question "How much information do we lose by not knowing ©?" The 
Kullback-Leibler information is E# 0 log[/x n |e(^n|0o)//x n (^n)] for com- 
paring the distribution of X n given Q = Oq to the prior predictive distri- 
bution. Alternatively, we can examine log[/x n |e(^n|#o)//x n (^n)] and see 
how it behaves for large n. 

Theorem 7.106. Assume the conditions of Theorem 7.102. Then, given 
9 = Oo, 

- 21og f yi%^° ] + *log(^) " 21og/e (Oo) + log \l Xl (0o)\ * xl 

(7.107) 

where 2xi(#o) is the Fisher information matrix for one observation. 

Proof. We have assumed enough conditions to conclude that the general 
regularity conditions hold and that En 1/ ^ 2 (0 n - #o) —> ^(0,7^), given 
0 = 0 O . Hence, we can use whatever steps we wish from the proof of 
Theorem 7.89. As in (7.90), we can write 

log/x n |e(*n|0o) = log/ Xn | 6 (X n |e n ) 

- \ [(0o - e„) T s;* (j fc - x n ))Enho - e„) + a„] . 

Using (7.92) and the consistency of the MLE, we conclude that 
log/ Xn| e(X n |g 0 ) - log/ Xn |e(X n |e n ) 



-|(#0 ~ ©n^En^O "~ ©n) 



= 1 + 0 P (1) 



given © = #o- Since the denominator of this expression converges in distri- 
bution to —.5 times it follows that the numerator does also. It follows 
from (7.98) that 

log/ Xn (X n ) - log/ Xn|e (X n |e n ) - 1 log |E n | 4 k - log(27r) + log/ e (0 o ). 

p 

Since the determinant is a continuous function of a matrix, and nE n — ► 
IxAOo)" 1 , we get 

log/x n (X n ) - log/ Xn|e (X n |e n ) + *^ 4 
\ log(27r) + log /e(flb) " \ log \L Xl (*o)|. 



"This section may be skipped without interrupting the flow of ideas. 



446 Chapter 7. Large Sample Theory 

The conclusion now follows. □ 
Theorem 7.106 says that the amount of Kullback-Leibler information lost 
for not knowing 6, when observing n conditionally IID random variables, 
tends to be about k log(n)/2, for all continuous priors. The effect of the prior 
distribution is of a lower order of magnitude. Notice that choosing /e(0) = 
I^Xi (0)| 1//2 makes the last two terms on the left-hand side of (7.107) cancel. 
Suppose that \Jx 1 {0)\ 1 ^ 2 is integrable with respect to Lebesgue measure. 
In that case, c\lx l (0)\ 1 ^ 2 would then be Jeffreys' prior (for some c) as 
described in Section 2.3.4. Note also that 

/ lQg J S{ l^ fe{0)de -°> 
J c\2 Xl (9)\2 

for every prior /e, with equality only when /e is Jeffreys' prior. It follows 
that if Jeffreys' prior describes someone's beliefs, then that person believes 
that his or her predictive density for the data fx n (X n ) will (asymptotically) 
be smaller, relative to fx n \e{X n \^o)^ than that of someone who believes 
any other prior. An alternative way to say this is that a person believing 
Jeffreys' prior thinks that he or she has more to learn about 0 from the data 
than does someone believing a different prior. This informal description can 
be made more rigorous, as Clarke and Barron (1994) do. They consider a 
decision problem in which the action space is the set of continuous prior 
densities, and the loss is 

j (a f \ ~ t? w fx n \e{X n \0) 
^ /e) " Eel0§ lnk|eftWe(P' 

Note that this loss is precisely the Kullback-Leibler information for com- 
paring the distribution of X n given 6 = 6 to the prior predictive distri- 
bution. They show that Jeffreys' prior is asymptotically least favorable in 
this decision problem. 

7.4.3 Laplace Approximations to Posterior Distributions* 

In Section 7.4.2, we calculated the asymptotic distribution, given © = 0, of 
the integral of the posterior density over some set. Sometimes, one is only 
interested in the value of such an integral. The method of Laplace gives 
us a way to calculate approximations to such integrals together with an 
order of magnitude of the error. This discussion is a hybrid of the papers 
by Tierney and Kadane (1986) and Kass, Tierney, and Kadane (1990). 

Suppose that we are interested in the posterior mean of a positive func- 
tion g of O. For example, if g{0) = f Xl \e(y\0) for some fixed value y > then 
E{g{Q)\X = x) is the predictive density of a future observation. In general, 



"This section may be skipped without interrupting the flow of ideas. 



7.4. Large Sample Properties of Posterior Distributions 447 



we can write 

E( g (8)|X = x) = //e(g)/x|eW , • 

The method of Laplace provides approximations to each of these integrals 
for specific values of x. Some conditions and notation are needed to state 
the approximations precisely. 

Theorem 7.108. For each n, let (X n ,B n ) be a Borel space, and let X n 
be a random quantity taking values in X n . Let X n — (Xi,...,X n ) and 
let (X n ,B n ) be the product space of X u . . . ,X n . Let {P e : 0 € SI} be a 
parametric family of distributions for {XnJ^j with SI C JR. Suppose that 
the distribution of X n given 0 = 0 is absolutely continuous with respect to a 
measure v n on (X n ,B n ) for all n with density fx n \e{'\0)- Let g : SI — ► 1R + 
be a function. Let /e(0) be the prior density ofQ with respect to Lebesgue 
measure. Assume that /x"|e(^ n |^) for all n and x n 6 X n , g{0), and fe(0) 
are all continuously differentiable with respect to 0 six times. Assume that 
f 9(0)fe(0)d0 < oo- define 

e n (0;x n ) = log/ X n,eOr n |0), 

H n {0;x n ) = l[^ n (0;x") + log/eW], 
n 

K(0;x n ) = H n (d;x n ) + -logg(d). 

n 

Now let y = n^Li X n and define the set A C y as the set of all x — 
(x 1 ,^ 2 ,^ 3 , . . .) G y with the following properties: 

• The integrals J g{0)fe{0)f X n\e{x n \0)d0 and f fe{0)fx»\e{x n \0)d0 
are finite for all n. 

• £ n achieves its maximum at a point 0' n {x n ) for each n. 

• For each n, H n and H* achieve their maxima at points 0 n {x n ) and 
0*(x n ), respectively, where the first derivatives are zero. 

• 0 n (x n ) #n(x n ) converge as n — ► oo. 

• The second derivatives of H n and H* at their maxima converge to 
negative numbers. 

• There exist 6o,No,M > 0 such that for all n > No, the absolute 
values of H n ,H* and their first six derivatives are all bounded by M 
for \0-0 n {x n )\<26 o . 

• For every 6 > 0, 

limsup- sup i n (0\x n ) -i n (0' n {x n )\x n ) < 0. (7.109) 

n—oo Tl |0_£ n ( x n)| >6 



448 Chapter 7. Large Sample Theory 



For each (xi,X2,£3, ...) 6 A, define 

(T 2 (x n ) = rr* 2 (T n ) - - 

nl J H»0 n {*");**y n{ ] H? (9* (at*)-, *"Y 
For each (x u x 2 ,x 3l . . .) € A, E(g(Q)\X n = x n ) equals 

exp(n[#;(0*(*»);x») - H n {9 n {z»);z»)]) [l + <?(n~ 2 )] . 

Proof. Since virtually everything depends on n and x n in the statement 
of the theorem, we will simplify the notation by not explicitly expressing 
that dependence. For example, H0) will stand for H n 0 n (x n );x n ). Now, 
let a?2, ...) € A and write 

E( ff (6)|X» = *») = f ?^^^ - (7.110) 

We assumed that 0 and 0* both converge. We now show that they con- 
verge to the same thing and that so does 0'. Suppose that 0 converges to 
0o and 0* converges to 9\. If there exists 6 > 0 such that \ff - 9\ > 8 
for infinitely many n, then (7.109) says that there is 77 > 0 such that 
10) < t{9 f ) - 77 infinitely often. Since H0) = 10) + 0(n _1 ), and simi- 
larly for 0', it follows that H0) < H(0') - 77, infinitely often. This con- 
tradicts the definition of 6 being the location of the maximum of H for 
all n. It follows that, for each 6 > 0, there exists N such that n > N 
implies \9 f - 0| < 6. Hence 0' converges to 0o also. A similar argument 
shows that 9' converges to 0i, hence 9q = 9\. Let £t' = n (0o — <$o,0o + 
tf 0 ). Since exp(nif(0)) = exp(£(0))/ e (0), condition (7.109) implies that 
fn\n f exp(n#(0))d0 and J Q ^ Qf exp(nH* (9))d9 are exponentially small. For 
this reason, we can replace Q by Q,' in (7.110) without incurring an error 
larger than 0(n~~ 2 ). 

If we expand #(0) in a Taylor series (see Theorem C.l) around 0 = 0, 
we get 

H(9) = H0) + (0 - 0)#'(0) 4- ~(0 - 0) 2 #"(0) 4- ~(0 - 0) 3 #"'(0) 

+ - 0) 4 tf^(0) + ^(0 - 0) 5 // (v >(0) + ^(0 - 0) 6 # (vi) (0), 

where 0 is between 0 and 0, and H^ iv \ H^ v \ and respectively stand 
for the fourth, fifth, and sixth derivatives of H. Use the Taylor series of 
exp(x) around x = 0 and the fact that H'0) = 0 to write 

exp(n//(0)) = exp(nH(0))exp(^(0~0) 2 H ,, (0)) 

x [1 + ^0- 9fH'"0) + ^(0 - 0) 4 # (iv) (0) 

+ JL( 9 - 9fH^\9) + ^(0 - 0) 6 ff"'(0) 2 + Rn(«)] , 
120 72 



7.4. Large Sample Properties of Posterior Distributions 449 



where 

J KWexp - efH"{0)) d9 = 0(n~ 2 ) 

as n — ► oo, because R n {0) is bounded on the bounded set Q,'. We can also 
show that 

J (0 - 9) k exp (-^j (9 - §f) d9 = 0(n- 2 ), 

for all odd fc. (In fact, these last integrals are exponentially small.) This 
implies that 

J exp(nH(9))d9 = exp(nH{9))(^J exp (-^ ( e ~ <?) 2 ) 
x [l + ±(6 - 6) 4 H<- iv \9) + ^(0 - 0) 6 #"'(0) 2 ]d0 + 0(n" 2 )^ 



= V^-j= exp(nH(9)) + ^H™(§) + 7^H"W + 0{n^) 

(7.111) 

A similar argument shows that 

[ exp(nH*{0))d0 = y/2^^= exp(n#(0*)) (7.112) 
Jn Vn 

x [l + ^fr(*O(0*) + ^!/r'>*) 2 + 0(n" 2 ) . 
[ 8n 24n 

Next, we prove that 0 and 6* differ by 0{n~ l ). Since if* has 0 derivative 
at 6* , we can write 

0 = H*'(9*) = H'{0*) + O(n- 1 ) 

= H'{0) + (0* - 0)H"(0) + ©(rT 1 ) + o(0 - 0*) 



1 



= + 0(0+0(0 -0*)- 



It follows that 0* - 0 = 0(n _1 ) + o(^-^). So, it follows that 0* - 0 = 
0(n~ l ). It also follows that the fcth derivative of H* at 0* differs from the 
fcth derivative of H at 0 by 0(n~ l ). In particular, a* 2 = cr 2 + 0{n~ l ). 
Now, take the ratio (7.112) divided by (7.111) to get that E{g(0)\X = x) 
equals 

°- e X p(„[fl-,n - Hm l±£E^m±£^^^ 

o FV 1 V 7 WJ; l + fiH(«»)(tf) + ^^"'(fl) 2 + 0(n- 2 ) 



= — exp(n[/T(0') - E{9)\) [l + 0(n" 2 )] . □ 

G 



450 Chapter 7. Large Sample Theory 



Theorem 7.108 makes claims about the conditional distribution of 9 
given X n = x n for a sequence of x n values having certain properties 
(namely being in the set A). Of course, we will only ever get to observe 
the beginning of such a sequence. If our model says that the set A occurs 
in probability, that is, V(A), then we might feel comfortable believing that 
the unobserved tail of the sequence will continue to produce a point in A. 
All that is required (in addition to the conditions of Theorem 7.108) is that 
the Fisher information from one observation be positive and finite for all 0 
and that the MLE is interior to ft with probability tending to 1. 

Example 7.113. Let {X n }%Li be a sequence of IID N(fi,cr 2 ) random variables 
conditional on (M,E,A) = (/i,a, A). Let the prior for M given E = <r, A = A 
be iV(/io,<7 2 /A). Let E and A be independent with E 2 ~ r -1 (ao/2, bo/2) and 
A rsj Exp(co). Conditional on A = Ao, this is precisely the same as the natural 
conjugate prior distribution, which was given in Example 1.24 on page 14. Hence, 
the conditional density of X = (Xi,. . . ,X n ) given A = A is the same as the 
marginal density of the data in that example. This is easily calculated as 



/ _ ap+n 

\I^Tx( bo+w+ ^h {w - tM>)2 ) 2 • 

Suppose that we want to calculate the predictive density of a future observation 
Y. Conditional on A = A, 

. { Xyo + nx bo+w+^(x-iio) 2 r l]^ 
0 y A + n ao +n L AJy 

So, for each value of y, we can let g{\) be the density of Y at y and apply 
Laplace's method for many values of y. _ 

As an example, suppose that we observe n = 10 observations with x = 14.7 
and w = 52.2. Suppose that the prior had a 0 ■= 1 = &o = Co and no = 10. Then 
0 = 0.1598 provides the maximum of the function H. For each value of y between 
0 and 30, say, we can let g(X) be the *n density as described above, and we get 
the plot in Figure 7.114. For comparison, Figure 7.114 includes the predictive 
density that would have been obtained had Pr(A = 1) = 1 been assumed. (The 
prior mean of A is 1.) 

A naive alternative to the Laplace approximation is to use the MLE 
g(0) to approximate the posterior mean. First, note that exp(n#*(0*)) = 
ff(0*)exp(nJJ(0')). Since 0 - 9* = ©(n" 1 ), we get that g(0*) = g(0) + 
Oin' 1 ) and (as in the proof) a* 2 = a 2 4- Oin' 1 ). Combining these facts 
together with the fact that H(0*) = H(0) + O([0* - 0) 2 ) we get that the 
difference between g(0) and the Laplace approximation is 0(n l ). Since 
the Laplace approximation differs from the E{g(S)\X n = x n ) by 0(n ), 
we get that g(0) differs from E(g(e)\X n = x n ) by ©(n" 1 ). So, the Laplace 
approximation can be thought of as a higher-order correction to the use of 
the MLE as an approximation to E(#(6)|X n = x n ). 



7.4. Large Sample Properties of Posterior Distributions 



451 




Laplace 
A -1 



— i r~ 1 1 1 1 1 — ' 

O 5 lO 15 20 25 30 

FIGURE 7.114. Predictive Density for Y in Example 7.113 



The Laplace method is most useful in hierarchical models, which will be 
discussed in more detail in Chapter 8. An example in which the Laplace 
method does not work so well is Example 7.104. 

Example 7.115 (Continuation of Example 7.104; see page 444). We have ob- 
served ten random variables with Cau(0, 1) distribution given 0 = 9. Suppose 
that our prior distribution for 9 was very flat, say AT(0, 1000). We are now in 
position to approximate the mean of any positive function of 0. For example, 
for each t, we can approximate the mean of exp(£0). This would be the mo- 
ment generating function of 0. For each x we could approximate the mean of 
[7r(l + (x — 0) 2 ]" 1 . This would be the predictive density of a future observation 
at x. A serious problem arises with this data set. For some values of t or x, 
the associated function H* is not unimodal. This makes the use of the Laplace 
approximation unsatisfactory. 

Nevertheless, we are able to approximate the moment generating function for 
small values of t at least. This is done by using the function #(0) = exp(£0) 
for several different small values of t. Kass, Tierney, and Kadane (1988) suggest 
using numerical derivatives of the moment generating function for approximating 
moments of the parameter. For example, if we use t = /cx 10~ 5 for k = 0, 1, 2, 3, 4, 
we can approximate two derivatives of the moment generating function at 0. 
Laplace's method uses g(9) = exp(£0) for the values of t listed above and gives 



t 


E(exp(t0)) - 


1.0 


0.0 




0.0 




1.0 x 10" 


•5 


4.49078 x 10 


-5 


2.0 x 10" 


5 


8.98176 x 10 


-5 


3.0 x 10" 


-5 


13.47294 x 10 


-5 


4.0 x 10" 


-5 


17.96432 x 10 


-5 



We can fit a quartic polynomial to these values, and its first two derivatives at 



452 Chapter 7. Large Sample Theory 



0 will approximate the first two moments. The fitted quartic is — 31677440a; 4 + 
2972.6aT 4- 9.85075a: 2 4- 4.49068a: + 1.0. The first derivative is 4.49068 and the 
second is 19.7015. Unfortunately, the estimated variance would be negative, so 
these moment estimates are not very good. The problem here is that, for some 
values of £, the function if* is not far from being multimodal. 

Using numerical integration (trapezoidal rule for — 20 < 0 < 40 with 6000 in- 
tervals), we approximate the mean to be 4.58994 and the variance to be 2.20456. 
The moment generating function can also be approximated by numerical inte- 
gration. It is 



t 


E(exp(t9)) - 1.0 


0.0 




0.0 


1.0 x 10" 


-5 


4.59005 x 10" 5 


2.0 x 10" 


-5 


9.18034 x 10" 5 


3.0 x lO - 


-5 


13.77086 x 10" 5 


4.0 x 10" 


-5 


18.36161 x 10~ 5 



If we fit a quartic to this, we get 

-226671a: 4 + 38.17317a: 3 + 11.63566s 2 + 4.58994x + 1.0, 

from which we approximate the mean as 4.58994 and the variance as 2.20380. 

For multiparameter problems, the situation is much the same except that 
second and higher derivatives are more complicated objects. For a function 
/ of k variables with p derivatives and for points x and y in IRA, we define 

k k p 



jl = l j P = l3=l 31 ° P 



z=x 



This is the analogue to the pth derivative evaluated at x times y to the 
power p. In particular, D&\f',x,y) = y T My, where M is the matrix of 
second partial derivatives of / evaluated at x. All of the above reasoning 
applies as well in the case of fc-dimensional 0. For example, 

exp(nH(0)) = exp(nH(6))exp(~(e-6) T X-\e-e)) [«) 

+ i + 2£><3> (# ; § t e - 9) + ^ £> (4) (H; 9,0-9) 

+ ^D^{H'A0- 9) + £[DI*Hh-A9- 0)] 2 ] , 

where E is minus the inverse of the matrix of second partials of H evaluated 
at 6 and 

£ Rnie) exp(~(0 - 9) T H-\e - 9)) d9 = 0(n~ 2 ), 
as n -» oo. The net effect of these modifications is that 

E(g(e)\X = x) = exp(n[H*(9* ) - H(9)}) [l + 0(n~ 2 )] , 



7.4. Large Sample Properties of Posterior Distributions 453 



where E* is minus the inverse of the matrix of second partials of H* eval- 
uated at 0*. 

One additional thing that we can do in the multiparameter case is ap- 
proximate marginal densities of subsets of the parameter vector. Such den- 
sities are ratios of integrals of different dimensions, and we will not obtain 
0{n~ 2 ) approximations in this case. 

Theorem 7.116. For each n, let {X n ,B n ) be a Borel space, and let X n 
be a random quantity taking values in X n . Let X n = (X\, . . . , X n ) and let 
(X 71 , B n ) be the product space of X\, . . . , X n . Let {Pe : 0 Eft} be a paramet- 
ric family of distributions for {Xn}£Li ™^ ft C IR* with k > 1. Suppose 
that the distribution of X n given 0 = 0 is absolutely continuous with re- 
spect to a measure v n on (X n ,B n ) for all n with density /x n |e(*l^)- Let 
fe{9) be the prior density ofQ with respect to Lebesgue measure. Assume 
that fx n \e( xTl \0) for all n and x n e X n , and fe{0) ore all continuously 
differentiable with respect to 0 four times. Write a typical point 0 E ft as 
0 = (7,^), where 7 E IR P and ip E IR k ~ p with 1 < p < k. For each 7, let 
Q( 7 ) = : ( 7 ,^) e ft}. Define 

£ n (0;x n ) = log/ X n|e(* n |0), 
H n (0-x n ) = i[4(0;x n ) + log/ e (0)], 

K(^x n n) = #n((7,V0;* n ). 

Now let y = n^Li X n > ana * define the set A C y as the set of all x = 
(x 1 , x 2 , x 3 , . . .) E y with the following properties: 

• The integrals f fe{0)fx»\e(x n \0)d0 and f /e(0,VO/x»|e(z n |7, 
are finite for all n and 7. 

• £ n achieves its maximum at a point 0' n {x n ) for each n. 

• For each n and 7, t n (l,ip) achieves its maximum at a point that we 
will call ip' n (x n ;y). 

• For each n, H n achieves its maximum at a point 0 n (x n ) where the 
first derivative is zero. 

• For each n and 7, jFf*(-;x n , 7) achieves its maximum at ipn(x n ;*y), a 
point where the first derivative is zero. 

• 0 n (x n ) and ^n(^ n ;7) converge as n — ► 00. 

• The second derivatives of H n and /f* at their maxima converge to 
negative definite matrices. 

• There exists 6q such that the absolute values of H n and its first four 
derivatives are all uniformly bounded for \\0 - 0 n (x n )|| < 2<5 0 . There 
exists 6\ such that for all 7 the absolute values of H* and its first 
four derivatives are all uniformly bounded for - ty* n (x n ,7)|| < 2<5i. 



454 Chapter 7. Large Sample Theory 
• limsup- sup t n {0\x n ) - l n {0' n (x n )\x n ) <0, V6 > 0. (7.117) 

n-oo U ||0_0 n(x „)|| >6 

For each (xi,£2,£3, . . .) 6 A, define 

8> 




-I 



V>=t/>;(x n ;7), 



For each (xi,^,^ ...) e A, the marginal posterior density ofT (the first 
p coordinates ofQ) given X n = x n is /r|x»(7|z n ) equal to 

n%\a*{x n ryjjj 
(27r)«K(x«)|i 

x exp(n[ff;(^(x n );* n ,7) - H n (0 n (x n );x n )]) x [l + ©(rT 1 )] . 

Proof. As in the proof of Theorem 7.108, we will suppress the dependence 
on n and x n . Let {x\, £2, • • •) € -A, and write 

/r ' Xn(7|X } = / n ex P (nff(fl))dfl ' ^ 

As in the proof of Theorem 7.108, 0' converges to the same thing to 
which 0 converges, and t//(7) converges to the same thing to which rp*(^y) 
converges. Let Q' be that part of ft inside the ball of radius 6o around 
0 O . Since exp(n#(0)) = exp(£ n (0))/ e (0), condition (7.117) implies that 
fn\n f exp(niJ(0))d0 is exponentially small. For this reason, we can replace 
Q, by fi', the part of inside a ball of radius 6 0 around the limit 0 O of 0. 
We can also replace £1(7) by Jl'fr), the P art of fi W inside a bal1 of radius 
6 0 around the limit of ^(7). The error in (7.118) for doing this is no larger 
than 0(n~ l ). 

By expanding H(0) in a Taylor series (see Theorem C.l) around 0 = 0, 
as in the proof of Theorem 7.108, we can obtain 

/ exp(ntf (0))d0 = (2tt)4 exp(nif (0)) x [l + 0(n~ 1 )} . 

Similarly, 

/ exp(n/T(i/>;7))# = (2ir)^^exp(fiH*(^(7)))x[l + 0{rT 1 )] . 
Taking the ratio of these two gives the result. D 



7.4. Large Sample Properties of Posterior Distributions 455 



The proof of Theorem 7.116 can be adapted to show that the approximate 
Bayes factor in (4.27) on page 227 is an <D(n~ l ) approximation to the true 
Bayes factor when the hypothesis is H : T = 70 . In this case, the parameter 
under the hypothesis is \I>, the last k—p coordinates of 0. One must replace 
0' by 0 and ^'(7) by V*(7) in the approximation of Theorem 7.116, but 
this does not alter the order of the approximation. One can also show that 
the 0(n~ l ) term in Theorem 7.116 is uniform for 7 in compact sets. 

7.4.4 Asymptotic Agreement of Predictive Distributions* 

The next theorem is due to Blackwell and Dubins (1962). It concerns the 
difference between posterior predictive distributions calculated under two 
different models. 31 If the two models are somewhat similar (in the sense 
that they at least assign positive probabilities to the same events), then 
the posterior probabilities (calculated from the two models) of every event 
will become uniformly closer as the amount of data increases. 

To be precise, we will need to set up some notation. Let (X n ,B n ) be 
measurable spaces, let X = YliLi %u and let B be the product cr-field, 
B = Bi®B2<8>- • •. Suppose that P and Q are probability measures on {X, B). 
Let P n and Q n be the respective marginal distributions on {y n ,C n ), where 
y n = X\ x • • • X n and C n = B\ (g) • • • <8>B n . Let P n and Q n stand for versions 
of the conditional distributions on (y n ,C n ) given the first n coordinates, 
where y n = X n +\ x X n + 2 x • • • and C n = # n +i <g> # n+2 £>•••. That is, 
for each B G C n , there is a C n measurable function P n (B\x\, . . . ,x n ) such 
that, for each (#1, . . . , x n ) G y n , P n (-|xi, . . . , x n ) is a probability measure 
over C n and for each bounded measurable 0 : X — ► IR, 



(and similarly for Q n ). For arbitrary probability measures 5 and T on the 
same cr-field C, let 



Before we state and prove the main theorem, we will give conditions 
under which the hypotheses of the theorem hold. 

Lemma 7.119. Let 7T2 *C 7Ti be probability measures on a parameter space 
(fi, r) with parametric family {P e : 0 G f2}. Suppose that for every B e B, 



+ This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 
31 The proof relies on martingale theory. 




p(S,T) = sup \S(A)-T(A)\. 




456 Chapter 7. Large Sample Theory 



Then Q < P. 

Proof. If Q(B) > 0, then there exists a set C C ft such that P e (B) > 0 
for all 0 e C and tt 2 (C) > 0. It follows that m(C) > 0 and then that 
P(B) >0. □ 
The importance of Lemma 7.119 is that it applies to the popular model 
in which data are conditionally IID given some parameter 6 with a dis- 
tribution in a parametric family. Lemma 7.119 says that if two Bayesians 
agree on the parametric family but disagree on the prior distribution, then 
Theorem 7.120 will apply to them, so long as one of the prior distributions 
is absolutely continuous with respect to the other. 

Theorem 7.120. IfQ < P, then for each P n there exists a version ofQ n 
such that 



Q 



(zi,X2,...) : lim p(P n . . . ,x n ) ,Q n (-\x u . . . ,x n )) = 0 



= 1. 



The proof of this theorem requires some lemmas and corollaries. 

Lemma 7.121. Let Q be a probability measure, and let E denote expecta- 
tion with respect to Q. Let {Yn}™^ be a sequence of random variables such 
that lim n _oo Y n = Y a.s. [Q] and \Y n \ < m for all n and some nonnegative 
m. Let {Uj}j t L 1 be an increasing sequence of a -fields. LetU be the smallest 
a-field containing all of the Uj. Then 

lim E(Y n \Uj) = E(Y\U). 

j — > oo , n — ► oo 

Proof. Let G k = sup n > fc y n . For fixed k and n > h, Y n < G k and 
E(Y n \Ui) < E(G k \Ui) a.s. for each i. Define 

Z = lim sup E(Y n \Ui). 

j-^oo i>j >n >j 

Then 

Z < lim snpE(G k \Ui) = lim E(G k \Ui) = E(G k \U) -+ E(F|W), 

as k oo. The first equality holds because the supremum decreases as j in- 
creases. The limit follows from the martingale convergence theorem B.117. 
Similarly, we can show that 

lim inf E(Y n \Ui) > E(Y\U). 

j-^oo i>j,n>j 

Together these imply the lemma. D 
Corollary 7.122. If the probability is 1 that only finitely many of{E n }™ =l 
occur, then 

lim Q{\Jf =n E k \Uj) = 0, a.s. 

n— *oo,j— ♦oo 



7.4. Large Sample Properties of Posterior Distributions 457 

Proof. The condition in the statement of the theorem is Q(U™ =1 n£L n 
Eg) = 1. This is equivalent to Q(n£° =1 U£L n E k ) = 0. Let Y n be the 
indicator of the event U£L n i?fc. Then Y = linin-ooYn is the indicator of 
fl~ =1 U%L n E k , and E(Y\U) = 0. Now apply Lemma 7.121. □ 

Corollary 7.123. //linin—ooTn = 0 a.s., then for each e > 0, 

lim Q(sup |r fc | > e\Uj) = 0, a.s. 

Proof. Let JS fc = {|T fc | > e}, so that {sup fc > n \T k \ > e} = U£L n £*. Now 
apply Corollary 7.122. " □ 

Lemma 7.124. Let Q < P and let q = dQ/dP. Define 

q n (x u ...,x n ) = j q{x l ,x 2 ,...)dP n {x n +i,...\x l ,...,x n ), 

^(xx,...^) y« n> "' 

1 V ?n = 0. 

TAen, g„ = dQ n /dPn, and for each e > 0, 

lim Q"(|d n -l| >e|xi,...,*„) = 0, a.s. [Q]. 

n— *oo 

Proof. Since, for every B e S n , 

y 9n(^li---»^n)d-Pn(^l».--,^n) 

= y J g(xi,x 2 ,. • .)dP n (x n +i, . . . . . ,x n )dP„(xi, . . . ,X n ) 

= / q{xi,x 2 ,...)dP(xi,x 2 ,...)= I dQ(a:i,x 2 ,...)» 
JBxy n JBxy n 

which equals Q n (P)- Hence, g n = dQ n /dP n . 

Under probability P, E[g(Xi, X 2 , . . . . . ,x„] = 9n(zi» • • • >£n)) so 
{<7n(-Xi» • • • , Xn)}^! is a martingale. By Part I of Levy's theorem B.118, 
we conclude q n — ► q a.s. [P]. Since g n > 0 a.s. [Q n ], this implies that d n — ► 1 
a.s. [Q]. Now apply Corollary 7.123 with Uj =Cj. □ 
Proof of Theorem 7.120. For convenience, let u denote (xi,...,x„) 
and let v denote (x n +i, . . .). For each P n , let 

Q n (C|u)= / d n (u,*;)dP>|iz), 
Jc 

which is a version of the conditional distribution of Q given C n since, for 
each AeC n and C e C n , 

/ Q n (C|u)dQ n (u) = / / ^l^ d P n (v\u)q n (u)dP n (u) 

J A JaJc QnW 

= [ [ q{u,v)dP{u,v) = Q{AxC). 

J A JC 



458 Chapter 7. Large Sample Theory 

For each u and e > 0, let 

A(u) = {v : d n {u,v) > 1}, A(u,e) = {v : d n (u,v) > 1 + e}. 

Then 

p(P n (»,Q"(.|u)) == / {d n (u lV )-l)dP n (v\u) 

JA(u) 

< c+ / [d n (u,u)-l]dP n (v|u) 

•M(ii |C ) 

< c+ / d n (u,v)(iP n (v|u) 

•M(u,«) 

= e + Q n {A{u,t)\u). 
Now, write {lim^oo p(P n (> n ) ,Q» (> n )) = 0} as 

OO OO 

nun {p(P n (-\u n ),Q n (-\u n ))<2t} 

OO N=ln=N 

OO OO 

2 nun {Q n (^«,e)K)< £ } 

c>0 N"=l n= AT 

OO OO 

2 HUH {Q n (|dn-l|>eK)<e} 

OO N=l n=N 

n{ n lim o Q n (K-l|>6K)=0}. 



OO 



The first containment is what we just proved, and the second is trivial. 
Lemma 7.124 says that Q of the last of these sets is 1, hence Q of the first 
of these sets is 1. Q 



7.5 Large Sample Tests 
7.5.1 Likelihood Ratio Tests 

Asymptotic theory can provide approximate tests in complicated situations. 
Let fi C IR P , and assume that ft H = {9 : g{6) = c}. Reparameterize, if 
necessary, so that the first A; coordinates of 0 are g(0), for k <p. Let 6 n ,// 
be the MLE of 0 assuming Q H is the parameter space, and let 9 n be the 
unrestricted MLE. Then the likelihood ratio (LR) criterion (as introduced 
in Section 4.5.5) is 

L _ su PeeQH f x \e(X\6) fx\e(X\Qn,H) 
sup een fx\e(X\0) fx\e(X\G n ) 



7.5. Large Sample Tests 



459 



We will first consider the special case in which p = k = 1 and g(0) = 0. 



under P c , that is, under H. 

For the more general (higher-dimensional) cases, we have the following 
theorem. 

Theorem 7.125. Assume the conditions of Theorem 7.63. Let L n be the 
LR criterion for testing H : 0j = c» for i = 1, . . . , k. Then -2 log L n -5- \\ 
under H. 

Proof. Let c T = (ci,...,Cfc), and let 0 O e Q, be of the form 0j = 
(c T ,^), where has dimension p — k. Under i/, the parameter is \P T = 
(9fc+i, . . . ,0 P ), and the conditions of Theorem 7.63 hold in this smaller 
problem. 

We will find the asymptotic distribution of — 21ogL n under Pq 0 and see 
that it does not depend on ipo- Let O n ,/f be the MLE assuming that Qh is 
the parameter space. Then 



Then 



-21ogL n = ~21og/ x , e (X|c) + 21og/ X |©(X|e n ) 
= -2* n (c) + 2£„(e n ).. 



Suppose that £ n has two continuous derivatives. Then 



Uc) = e n (e n ) + ( C - e n )4(e n ) + |( c - e„) 2 c(^), 





We will also write the overall MLE in partitioned form as 




Then 




460 Chapter 7. Large Sample Theory 



Use Taylor's theorem C.l to write 



<«(C) =M *" )+ [( c)-* 



ddi ^n(©n) 



1 

+ 2 



(7.126) 

with 0* coordinatewise between 6 n and 6 n) #. Next, we use Taylor's theo- 
rem C.l to expand the gradient vector of £ n at both 0 n and 0 n ,/f around 
#o- The vector of partial derivatives around G n is the p-dimensional vector 



/' : \ 



V 



+ ((a& a4) )) ( ^^ o) ' (7 - 12?) 



where d\ is coordinatewise between 6q and 6 n . The vector of partial deriva- 
tives around O n ,H is the (p — A:)-dimensional vector 



/ 



\ 



+ ({^ e ^))^"-^ (7 - 128) 



where 0$ is coordinatewise between 0 O and 9 n ,H- It follows from (7.64) 
that 

(fcfc)-K(^)) 4 -*-*>-Ua> 

Equating the last p - k coordinates of the two 0 vectors in (7.127) and 
(7.128), we get 

DnM*n,H ~ rM = K{C - C) + D„(* n - V>o). 

We know from Theorem 7.63 that 4> n<H - i>o = O p {l/y/n), - ipo = 
Op{l/Vn), and c - c = 0 P (l/y/n). Also, we just proved that D n -D 0 = 
o P (l), Bl - B 0 = o P (l), and D n ,H - A> = o P (l). It follows that 



D Q (*n,H ~ 4>0) = Bjic - C) + D 0 (*n ~ iM + *P 



7.5. Large Sample Tests 461 



Hence, 



*n,H = *n + Dq 1 BJ(C - c) + 0 P 




(7.129) 



Now, combine (7.126) and (7.129) to conclude that 



@n ( ,f f I ^n(©n) 



" -2[D?Bj{t-c)\ Ix ^ o) [D^Bj(c-c) 
= J(c - c) T [A 0 - B 0 D» l Bj}(c - c) + o P (l). 



+ o P (l) 



The matrix Ao - BqDq 1 Bq is the negative of the inverse of the upper- 
left k x k corner of Jxi(#o)~\ which, in turn, is minus the inverse of the 
asymptotic covariance matrix of c. Since 



it follows that -21ogL n — > x 2 . Note that the choice of tpo is irrelevant. □ 
When appealing to the asymptotic distribution of the LR criterion, the 
tradition is to choose a and reject H if —2 log L n is greater than the 1 - a 
quantile of the xl distribution. 

Example 7.130 (Continuation of Example 7.104; see page 444). Using the same 
data as in the previous Cauchy example, suppose that we wish to test H : 0 = 5. 
The two values of the likelihood function are 

€io(4.531) = -101og(7r) - 27.36 and £ 10 (5) = -lOlogfr) - 27.50. 

So —2 log L n = 0.28, which is too small to reject H at any popular level. 

7.5.2 Chi-Squared Goodness of Fit Tests 

Another large sample test is the chi-squared (x 2 ) goodness of fit test moti- 
vated as asymptotically UMP invariant. If ft is the set of all distributions 
and 0o is one element of fi, then we can test H : 0 = 0o asymptotically as 
follows. Choose a fixed dimension p, and divide X into p disjoint regions 
R u ...,Rp. Let Qi = P B {Ri) and q it0 = Pe Q {Ri). We replace H : 6 = 0 O 
by H* : Q = qo- H implies if*, but Q,h* is bigger than Qh- The general 
result is the following. 

Theorem 7.131. Suppose that {^ n }5£=i are IID with distribution P. Let 
(.Ri, . . . , R p ) be a partition of X. Define Yj forj = 1, ... ,p to be the number 
of the first n Xi that are in Rj, and define qi = P(Ri). If 





462 Chapter 7. Large Sample Theory 



then C n Xp-i a& n — > oo. 

Proof. The distribution of y = (Y u . . . , Y P ) T is Muit(n; ft, . . . , q p ) and 
we know that 



where E = (((Tij)), with 



AT p (0,E), 



9 P 



1 ^(l-ft) if « = i. 
Let E* be the upper-right p-lxp-1 corner of E. Define 

*i - nqi,o 



Dn = ~(yi -™?i,o,...,y p -i -ngp_i )0 )E^ 



-l 



p-1 — ™7p-l,0 



and note that Z? n — ► x»-i- We can rewrite D n by using 





( 




0 0 ' 




91 








0 


0 










\ 


0 


0 q P -i . 




. 9 P -i . 





The inverse of a matrix of the form A - bb T is 



A' 1 + A- 1 bb r A~ 1 - 



This means that D„ can be written as 



1 -6 T o~ 1 6' 



i=l 



(yi-n^) 2 



C n . 



□ 

The traditional x 2 goodness of fit test is to reject the hypothesis that the 
distribution of the data is P if C n is greater than the 1 - a quantile of the 
Xp-i distribution. 

Example 7.132. Bortkiewicz (1898) reports data on the number of men killed 
by horsekick in the Prussian army. 32 The data were collected from 14 army units 
for 20 years. 



32 See Bishop, Fienberg, and Holland (1975) for a more complete analysis of 
this data. 



7.5. Large Sample Tests 463 



Number killed 


0 


1 


2 


3 


4 


> 5 


Count 


144 


91 


32 


11 


2 


0 



These data are clearly not uniformly distributed over the six categories, but we 
illustrate the \ 2 test witn eacn Qi = V 6 - Tne value of C 2 so is 366.4, which far 
exceeds the 0.9999 quantile of the xl distribution. 

A possible Bayesian approach to this problem is to try to measure how 
close the distribution P is to P. Of course there are many measures of close- 
ness. We could let Q = (Q u . . . , Q p ), where Qi = P(i^), and find a large 
sample approximation to the posterior distribution of Q based on just the 
data (Yi, . . . ,Y P ). Theorems 7.102 and 7.101 give one such approximation 
as 



where S = ((s*,j)), with 





( 


- Ui - 

n 


\ 


N r \ 






,s 






Ur 
_ n - 


1 



S *J ~" \ Viin-yj 



if * + 3, 
& if i = j. 



We could then examine the distribution of Q — q, where q = (q\ , . . . , g p ), or 
specifically of ||Q — or whatever. For example, let 5* be the upper-left 
p — 1 x p — 1 corner of 5 and consider 

I Ql ~ qi \ 
{Qi -qi,--iQp-i ~q P -i) T S~ l : 

\ Qp-i - g P -i / 

This quantity would have approximately an NCXp-iiYl^iiVi ~ n Qi) 2 IVi) 
distribution. 

A different type of hypothesis might be H : 0 € Vo, where Vo is a 
parametric family with fc-dimensional parameter space T with k < p — 1. 
This case was considered by Fisher (1924). 

Theorem 7.133. Let T be a k- dimensional parameter space with parame- 
ter * and k < p-l. Let R\, . . . , R p be a partition ofX. Let Yi be the number 
of observations in Ri for i = 1, . . . ,p. Call Y = (Y\, . . . , Y p ) the reduced 
data. Let S^p for ip £ T stand for the conditional distribution of the reduced 
data given * = ip. Define q^) = S^Ri) and q(ip) = (<?i(V>)> • • -iQpW)- 
Assume that q has at least two derivatives and is one-to-one. Let # n be 
the MLE based on the reduced data, and let Xx x (t/0 & e ^ e Fisher informa- 
tion matrix. Assume that \£ n is asymptotically normal 7Vfc(i/>,2 XiW"" 1 1 ,n )- 
Define q^ n = qi{^ n ) and 



464 Chapter 7. Large Sample Theory 
Then C n ^ Xp-k-i as 71 ~~ y 00 • 

Proof. The likelihood function for the reduced data is ((tp) = f]<=i qTW- 
Setting the partial derivatives of the log of the likelihood equal to 0 gives 
the equations 



for j = 1, . . . , fc. Since ^ n is v^-consistent and q is continuous, it follows 
that q i)n is a y^-consistent estimator of Since Yi/[nqi, n ] 1 for each 

z, it follows from the likelihood equations (and Problem 7 on page 468) that 



^ n 2 «?(*„) 

The argument we just finished for the case in which the hypothesis is 
simple shows that for every tp e I\ 



C W) = 2^ 7~7\ * Xp-v 

£f nq%Wt 



under S^. Then 



Use the delta method to write 

A: ^ 



It follows that 

1 ±--zn^h* j 



2 



7.5. Large Sample Tests 465 

* j — 1 t — 1 
So, we can write 

E X>> - - *».*)ap^* ( * n) f + 0p(1) ' 

We can rearrange the sum of the first set of terms inside the large brace 
and use (7.134) to remove these terms from the sum. Then 

c W - c n = - ± £ j ±. ^ _ * nj) ^. ft( * B) j 

Since = 0p(n) and the inner summations are both Op(l/n), we can 

p 

use the fact that li/[n^ jn ] — ► 1 for every i (and Problem 7 on page 468) 
to rewrite C(^) — C n as 

1 a 2 



9i(*n) > + Op(l)- 



Next, notice that 



"^4 9i(4 ' n) 4 9i(#n) } 4 -^ ( ^ t ' 

the (j, £) element Zxi(^)- Combining this with the previous equation, we 
get 

C(V) - C n = nty - 4> n ) T l Xl (iP)(iP - * n ) + o P (l). 



466 Chapter 7. Large Sample Theory 



Since X\ x (ifr) 1 /n is the asymptotic covariance matrix of i$! n , we have that 

C(ip) — C n 5- xl' Next we prove that C(ip) - C n is asymptotically inde- 
pendent of C n . This will make the asymptotic characteristic function of C n 
equal to the ratio of the asymptotic characteristic functions of C(t/>) and 
C(ip) - C n . Since the former is that of Xp_i and the latter is that of xh 
the ratio is that of Xp-k-v 

Define q n = (<?i, n > • • • <?p,n) T • Use the delta method to write 

yfctin ~ <?(</>)) = V$n -</>] + o P (l), 

where V = ((vij)) is the p x k matrix, with Vij = dqi(ip)/dil>j. It follows 
that y/n(q n -qfy)) is asymptotically AT p (0, Vlx 1 (ip)~ l V T ). Since q is one- 
to-one, V has rank A; and V~ = (V T V)~ 1 V T exists. It is easy to see that 
V~V is a k x k identity matrix. Hence, 

= "(<7n " q) T V- T l Xl {&)V-(q n -q) + o P (l). 

Also, since (Yi - nq i}Tl ) 2 /n = (9p(l) for each i and l/&, n 1/0* WO > we 
can use Problem 7 on page 468 to conclude 

^ y> - ng i>w ) 2 m 

C n = > 7T\ "I" Op(l). 



t=l 



The proof will be complete if we can show that Y - q n and q n - q(ij)) are 
asymptotically independent. Since they are jointly asymptotically multi- 
variate normal, it suffices to show that they are asymptotically uncorre- 
cted. Since Y - q(i/>) = Y -q n + (q n - <?(</>)), we need only show that the 
asymptotic covariance matrix of q n (namely, VIx 1 (^)V' T ) is the same as 
the asymptotic covariance between Y and q n . We find the latter as follows. 
First, note that (following some tedious algebra) 



E 



Hence the asymptotic covariance between Y and the vector D(tp) of partial 
derivatives of \og£(ip) is V. Next, use the delta method to write 



_L ry. 



where 



1 d 2 

n dipsdipt 



m n , a ,t = - — Q .,. log/W- 



7.6. Problems 467 



Set M n = ((m n>Sj t)), and note that M n — » Jx^^). It follows that 
~Z)ty>) = VST^ W(*n -4>) + o P (l). 

Hence, y/n(4> n -i/j) = X Xl (^)" l D(^)/n-hop{l). So, the asymptotic covari- 
ance between F and # n is VX Xl WO" 1 . Since g n - = o P (l), it follows 
that the asymptotic covariance between Y and q n is V r Xx 1 (V0'" 1 V rT , and 
the proof is complete. □ 
In applying Theorem 7.133, one must be careful to calculate the MLE of 
# based on the reduced data Y, not on the original data X. 

Example 7.135 (Continuation of Example 7.132; see page 462). A more rea- 
sonable hypothesis to test in the horsekick data example is that the distribution 
of horsekicks is a member of the Poisson family. Because there are no data in the 
w > 5" category, the likelihood function for the reduced data is the same as for 
the original data, and the MLE is the sample average $ 28 o = 0.7. The six val- 
ues of corresponding to i = 0, 1, 2, 3, 4, 5 are (139.0, 97.3, 34.1, 7.9, 1.4, 0.2), 
respectively, and C 28 o = 2.346, which is the 0.3276 quantile of X l distribution. 

Example 7.136. Let {X n }%Li be exchangeable. Suppose that we want to test 
the hypothesis that they have normal distribution. Let Ri = (-oo,n), Ri = 
(ri-uu] for i = 2,...,p- 1, and R? = (r p _i,oo). (For convenience,' define 
ro = ~oo and r p = oo.) Then g*(^) = *([ r< - p]/a) - *{[n-i - fj]/a) if $ = 
{(t,jj). The likelihood function to maximize is = [] t P =1 QiWY , where Yi = 
5^7=1 ^tC*? )- The MLE will not equal the sample average and sample standard 
deviation in general. 

Example 7.137. The usual X 2 test of independence in a two-way (r x c) con- 
tingency table is an example of Theorem 7.133. In this case, the data are not 
reduced, lhat is, each R { contains only one element of X. In fact, the R { are the 
ce Is themselves, which would be better denoted Ri tj for the cell in row i and 
column j (t - 1, . . . , r , j = 1, . . . , c ). The parameter * consists of two marginal 
probability vectors, one for the rows and one for the columns * c . Then 
<HjW = ^ . The MLE # n is easily seen to consist of *f equal to the row * 
total divided by n and *f equal to the column j total divided by n. One easily 
Ir^(r "-IK -"it USUal X * StatiStiC ' thG a PP r °P riate degrees of freedom 



7.6 Problems 

Section 7.1: 

1. Prove Proposition 7.4 on page 396. 

2. Prove Proposition 7.14 on page 398. 

3. Let X and {X n }S° =1 be random variables, and suppose that X n Z X 
Prove that X n = O p (1). 



468 Chapter 7. Large Sample Theory 

4. Let the conditional distribution of {X n }%Li be that of IID iV(/x, a 2 ) random 
variables given 6 = (/z, a). Define 



Find the asymptotic distribution of X n + 1.965 n . That is, find a n and b n 
so that a n [X n + 1.96S n — b n ) converges in distribution to a nondegenerate 
distribution. 

5. Suppose that for each 9 £ fi, (X ni B n , Pe,n) is a sequence of probability 
spaces. Let Y n : M n — ► JR fc be random vectors for each n. Suppose that 
Y n — op{\) for each 0. Let r be a sigma field of subsets of Q such that for 
every n and every AeB n) Po, n {A) is r measurable as a function of 9. Let 
Q be a probability measure over (n,r). Define Q„(.) = J Q P e , n (')dQ(9) for 
each n so that (A' n ,# n ,Qn) is a probability space. Show that Y n = op(l) 
with respect to the sequence {Qn}SJLi- 

w 

6. Consider the setup in Problem 5 above. Let X n — X for all n. If Pe, n —* P$ 
as n oc for each 0, let Q 0 (-) = J n Pe(-)dQ(9). Prove that Q n ^ <?o- 

7. Suppose that Zi, n c» and X», n = <9p(l) for i = 1, . . . ,p as n oo. Show 

that £f =1 Zi>nXi,n - Ef=l = Op(l). 

8. Suppose that p(t) > 0 is an even function of t with p(t) > 0 for £ ^ 0 and 
p(t) a strictly increasing function of \t\. If 



for all 0, then show that X n is consistent for #(0). 
9.*Suppose that y/n(Y n - /x) ^ JV fc +i(0,E). Let U* be the smallest root of 
the polynomial p n (u) = £* =0 Y n ^u\ where = (r n ,o, • • , Ki,*)- Let u 0 
be the smallest root of p{u) = £*L 0 /W. Assume that w <> has multiplicity 
exactly 3 (i.e., p(tio) = p'M = p"(u 0 ) = 0, but p"'M ^ 0.) Find a n so 
that a n (t/n -^o) converges in distribution to a nondegenerate distribution, 
and find the asymptotic distribution. 

10. Let {Xn}n=i be cond itionally IID with N(0,1) given 9 = 0. Let Y n = 
$((c-Xn)/\A ~ V 71 )' wnere $ is tne standard normal distribution func- 
tion and c is a constant. Find a n and b n such that a n (Yn - M has a 
nondegenerate limiting distribution, and find the distribution. 

11. Let {X n }n=i be conditionally IID Ber(6) given 0 = 0. Let Y n = n _1 £ 
and let = 2 8m- 1 (i.e., ^(s) = 5m 2 (z/2).) Suppose the prior for 
0 has density given by 



n 



n 




lim E e p(X n - g{6)) = 0 




where 0 < 0 < 1, p and cr are constants, and 




7.6. Problems 469 



Let Z n = g(Yn)> Find a n and b n such that a n Z n + b n converges in dis- 
tribution to a nondegenerate distribution with respect to the marginal 
distribution of the data. (Hint: Recall that dsin~ 1 (u)/du = {1 - w 2 } _1/2 .) 

Section 7.2: 



12. Suppose that F(t) = 1 and F(t - c) < 1 for all e > 0 and that F is differ- 
entiable at all values less than t with derivative / such that lim X | t f(x) = c 
with 0 < c < oo, and F is continuous at Let {Xn)n°=i be IID with CDF 
F and let X( n ) = max{Xi, . . . , X n }. Prove that n(t - X( n )) — ► Exp(c). 

13. Prove Proposition 7.34 on page 408. 

14. Prove Proposition 7.37 on page 410. 

15. Suppose that {X n }n°=i are conditionally IID with Cauchy distribution hav- 
ing median 0 given 0 = 0. Calculate the asymptotic efficiency of the sample 
median and of the best linear combination of three symmetrically placed 
sample quant iles. 

16. Let F(x) = [1 + exp(-x)]~ 1 . Assume that {X n }£Li are conditionally IID 
given 6 = 0 with CDF F(x -6). 

(a) Prove that the density is symmetric about 0. 

(b) If we wish to use the L-estimator based on Yp U \ , and with 
p < 1/2, find the best p and the best coefficients. 

17. *Let {An}5£Li be conditionally IID given 0 = 0 with density equal to 



fx\s(x\0) 



f 4(x-0+± 
\ 4(i-z + 6 



2 i if 0 - \ <x < 0, 
0) if0<x<0+|. 



(a) Find the asymptotic joint distribution of the p, 1/2, and 1 —p sample 
quantiles of a sample of size n as n — ♦ oo. 

(b) Find the best linear combination of the three sample quantiles p, 1/2, 
and 1 - p for estimating 0. 

(c) Try to find the best p if one wishes to estimate 0 using a linear 
combination of the three sample quantiles p, 1/2, 1 — p, and show 
that the usual analysis fails. 

18. In Problem 17 above, find the asymptotic joint distribution of the largest 
and smallest order statistics from a sample of size n. 

19. Let the conditional distribution of {X n }Z=i given 0 = 0 be IID C/(O,0). 
Let X^ denote the fcth order statistic based on Xi,...,X n . Find a n and 

b n such that a n (X^ — b n ) converges in distribution to a nondegenerate 
distribution as n — * oo for fixed k. 

Section 7.3: 



20. Return to Problem 16 above. 

(a) Find the asymptotic variance of the estimator found in part (b). 



470 Chapter 7. Large Sample Theory 



(b) Compute the Fisher information Tx x (0) and the efficiency of the es- 
timator found in part (b) of Problem 16. 

(c) Compute the efficiency of X n = Yli=i Xi/ n 85 an estimator of 6. 

21. Let {X n }n=i be conditionally IID given 6 = 0 with distribution U(0 2 ,0) 
where the parameter space is the interval (0, 1). 

(a) Find the MLE of O. 

(b) Find a nondegenerate asymptotic distribution for the MLE. 

22. Prove that the relative rate of convergence is unique by first showing the 
following. Let a n ,<4 > 0 and let H,H' be CDFs. If a n (G n - g(0)) Z H 

and a' n (G n - <?(#)) then lim n — <» a' n /a n = c € (0,oo) and H'(x) = 

H{x/c). 

23. Prove the claim at the end of Example 7.46 on page 413 about the rela- 
tive rate of convergence being the square root of the ARE for asymptotic 
variance. 

24. Let {X n }£Li be conditionally IID with N{6, 1) distribution given 6 = 0. 

Let X n = n~ l Yli =1 Xi and S n = E7=i( Xi ~ Let a " be such that 
Pr(S n > a n ) = 1/n. Let k n be the largest integer less than or equal to y/n. 
Consider the following two estimators: 

In if S n > a n k n 
^ i=i 

(a) Show that the ARE of U n to T n is 0 using the criterion of rate of 
convergence from Example 7.46 on page 413. 

(b) Show that for any fixed e > 0, 

Pl(\Tn-0\>e) = J + og), 

Pi(\Un-0\>e) = o(i). 

Comment on this in light of part (a). (Hint: If X ~ N(0, 1), then 
Pr(|X| > c) < 20(c)/c. 33 ) 

(c) What happens if we replace e by a/y/n in part (b)? 

25. Let 6 > 0 be a parameter, and {X n }n=i be a conditionally IID sample 
(given 6) with exponential distribution Exp(0), and let X n be the sample 
median. Let X n be the sample average. 

- — - <p 

(a) Find a(0) such that conditional on 9 = 0, y/n(\og(2)/X n - 0) -+ 
iV(O,a(0)). 

(b) Using the same criteria as in Example 7.44 on page 413, find the ARE 
of log(2)/X^ to \fX n as estimators of 6. 



33 This inequality is equivalent to Mill's ratio. 



7.6. Problems 471 



26. Let {X n }n=i be conditionally IID with N(0, 1) distribution given G = 0. 
We observe Y\ , . . . , Y n where 



r 0 if Xi < 0, 
Yi = < Xi if 0<Xi < 1, 
[ 1 if Xi > 1. 



(a) Find a minimal sufficient statistic. 

(b) Construct two different (i.e., different by more than op{\/\/n)) con- 
sistent, asymptotically normal estimates of 0, and compute their ARE 
using the same criterion as in Example 7.44 on page 413. 

27. Let the parameter space be two-dimensional. Suppose that {T n }%Li are 
conditionally IID given G = (0i,02) with density /r|e(£|0i» #2). Suppose 
that the conditions of Theorem 7.63 hold. Let 81 be the MLE of the first 
coordinate of 6. Let 61(^2) be the MLE of the first coordinate of G if 
it is assumed that the second coordinate is known to equal 02. Use the 
same criteria as in Example 7.44 on page 413 to find the ARE of these two 
estimators when it is assumed that the second coordinate of G is known to 
equal #2- Express your answer in terms of the Fisher information matrix. 

28. Under the conditions of Theorem 7.48, prove that 



Y[fx l \&(X i \9o) < f[fx l \s(Xi\e), infinitely often 



= 0. 



29. Verify the conditions of Theorem 7.49 in the case in which the observations 
are IID given G = 0 with density f X \e(x\0) = 0e~ 9x , for x > 0. 

30. Assume that = {0i, . . . ,0 m } is finite, and let the prior distribution be 
(7ri, . . . , 7r m ), with 7Ti = Pr(G = 0i). Let /ibea measure such that Pe <C p 
for each 0 £ H, and let 

/,<.) = ^<«). 

Assume that {X n }J° =1 are conditionally IID given G = 9% with density 
fi(x) for each i. Let G n be the MLE after n observations. Prove 

lim Pr(G n = 0i) = tt*. 

n— +00 

31. Let {X n }n=i be conditionally IID given G = 0 with JV(0, 1) distribution. 
Let O be the set of all integers. 

(a) Find the MLE G n of G and prove that it is unbiased. 

(b) Show that there exist positive constants a and 6 such that, for all 
sufficiently large n, Var^(G n ) < aexp(—6n) for every integer 0. {Hint: 
Use Mill's ratio from Problem 24(b) on page 470.) 

32. Return to the situation in Problem 16 on page 140. Find the MLE of G 
and its nondegenerate asymptotic distribution. 

33. Consider Example 7.60 (page 420) once again. This time, assume that we 
observe k > 2 observations with conditional mean //» for every i. Find the 
MLE of E 2 and what it converges to in probability. 



472 Chapter 7. Large Sample Theory 



34. Prove Proposition 7.71 on page 425. 

35. Suppose that {X n }%Li are conditionally IID given 0 = 0 with the following 
discrete logistic conditional density with respect to counting measure on 
the integers: 

(a) Find a v^-consistent estimator of 6. 

(b) Find an explicit form for an asymptotically efficient, asymptotically 
normal estimator of 0 based on Xi,...,I n . 

36. Let 

{-1 ifz<0-l, 
sin [f (x-9)] if0-l<x<0+l, 
1 ifx>0+l. 

(a) Prove that there is always a solution to Y^=i — 0- 

(b) Assume that the Xi are conditionally IID JV(0, 1) given 0 = 0. Now 
prove that for each e > 0, 

lim Pe 0 (3 a solution to ]T?. i){X u 9) = 0 in [0 O - e, 0 O + e]) = 1. 

37. Suppose that {X n }n=i are conditionally IID given 0 = 0 with U(-0,Q) 
and il = (0,oo). Find the MLE 0 n of 0 based on n observations, and find 
a n so that a n (0n - 0) has nondegenerate asymptotic distribution given 
0 = 0. Also find that distribution. 

38. Suppose that n arrows are fired at a circular target of radius a whose 
center is at the point (0,0,0) G IR 3 . The target lies in the plane where 
the third coordinate is 0. Suppos e that arrow i passes through the point 
(Xi,Yi,Q). Let Ri = y/xf+Y*. Suppose that (Xi.YJ) are conditionally 
IID N 2 (0,0/2) given 0 = 0. The data we observe are all {Xi,Yi) pairs for 
those arrows that hit the target. We also know n. 

(a) Find the distribution of R- /0 for an arbitrary arrow (whether or not 
it will hit the target). 

(b) Find the conditional probability Pe{R% < o) that arrow i hits the 
target. 

(c) Find the MLE 0 n of 0. 

(d) Find the asymptotic distribution of 0 n as n — ► 00. 

39. *Suppose that Yi,Y 2 are conditionally independent with Yi ~ Bin{rn,pi) 

given Pi = pi,P2 = P2, where n\ and n 2 are known sample sizes. The 
parameter space is Q = {(pi,P2) : P2 > Pi}- 

(a) Find the MLE (A, ft) of (A, ft). 

(b) Find the asymptotic distributions of A and of ft as m and n 2 go 
to infinity. (Hint: Consider the case pi = P2 separate from the case 
Pi < P2-) 



7.6. Problems 473 



40. Let {X n }^ =1 be conditionally IID with N(0, 1) distribution given 0 = 0. 
Let 0 n be any estimator whatsoever of 9. Let ip be the derivative (with 
respect to 0) of the log of the conditional density of each Xi given 0 = 0. 
Find the asymptotically efficient estimator of Theorem 7.75. 

41. *Suppose that {X n }£Li are conditionally IID U (0, 0) random variables given 

6 = 0. The parameter space is the interval [1,2]. Suppose that we only get 
to observe Y\ = I[ 0i i\(Xi) for each i. That is, we only see whether or not 
each observation is between 0 and 1. 

(a) Find the MLE of 0 based on Y u . . . , Y n . 

(b) Find the asymptotic (as n — ► oo) distribution of the MLE found 
above. 

(c) In terms of asymptotic efficiency, how does the MLE found above 
compare to the MLE based on observing the actual Xi values? 

42. Suppose that {X n }™=\ are conditionally IID given 0 = 0 with conditional 
density 

0 OL ot 

where a is known and the parameter space is Q = (0, oo). 

(a) Find the MLE 0 n of 0 based on Xi,...,X n . 

(b) Prove that 0 n is inadmissible if the loss is squared error. 

(c) Find a n and b n such that a n 0 n + 6„ has nondegenerate asymptotic 
distribution, and find that distribution. 

43. Let {Xn}^-! be conditionally IID given 0, a one-dimensional parameter. 
Assume the conditions of Theorem 7.63 and that there are no superefficient 
estimators. Let 0 n be the MLE of 0, and let {T n }^ =l be another sequence 

of estimators with y/n{T n - 0) ^ N(O,v(0)) given 0 = 0 for all 0 and T n 
a function of (Xi, . . . ,X n ). Consider the joint asymptotic distribution of 
y/n([Q n ,T n ] T — 01) given 0 = 0. Prove that the asymptotic covariance is 
l/lxi (0)- (Hint: Look at the proof of Theorem 5.9 on page 298.) 

44. * A psychologist is studying paired subjects. Each person in each pair is asked 

a yes-no question. Let Xij = 1 if person i in pair j answers yes (Xij = 0 
otherwise) for i = 1, 2 and j = 1, 2, . . .. The psychologist wants to assume 
that there are parameters 0 such that all of the Xij are conditionally 
independent given 0. Suppose that the psychologist believes that there is 
a number a such that 

Prpf 2)J = l|0=_g) = Pr (x ltJ - = i|e = fl) 
Pr(X 2 ,j = O|0 = 0) Q Pr(Xij = O|0 = 0) ' 

(a) Prove that there exist numbers /fe, . . • such that 

Pr(Xij = x,X 2tj = y\S = 6) = , (< f = f R ■ (7.138) 

1 + apj 1 4- Pj 



474 Chapter 7. Large Sample Theory 

(b) The psychologist decides to let 0 = (A, Bi, B 2 , . . .) so that (7.138) 
gives the conditional probability of observing (x, y) for pair j given 
0 = (a, (3\ , /3 2 , . . .)• Observations are then made for j = 1, . . . , m. Let 
Zj = X ld + X 2iJ -, T = #{j : = 1}, and S = £™ =1 X 2 , J / {1} (Z j ) = 
]C j:Zj . =1 -^2,j. Write the likelihood function. 

(c) Find the MLEs of A and B x , ... , B m . (Hint: First, find the MLEs of 
the Bi for fixed value of A, and then find the MLE of A.) 

(d) Since the MLE of A depends only on the pairs with Z 3 ; = 1, find 
the conditional density of the data given (Z u . . . , Z m ). Also find the 
conditional MLE of A based on this distribution. 

(e) Show that the conditional MLE found in part (d) is consistent as 
m — ► oo. 

45. Assume that {X n }£° =1 are conditionally IID with N(^<r 2 ) distribution 
given 6 = (/i,cr). (See the end of Example 7.52 on page 417.) Prove that 
for every 6 0 and every compact set C C Q, Ee 0 Z(C c y Xi) < 0 in the 
notation of Theorem 7.49. 

Section 7.4- 

46. Use the problem description in Problem 16 on page 75. Show that the 
posterior distribution of M given (Xi, . . . , X n ) is not consistent in the sense 
of Theorem 7.78. 

47. Assume the conditions of Theorem 7.78. Prove that there exists a subset 
ACQ, with /ie(A) = 1 such that for every 6 € A y 



48. Return to the situation in Example 7.82 on page 432. Prove that the pos- 
terior probability of C € does not almost surely converge to 1 given 0 = Oq. 
(Hint: Rewrite the posterior probability of (O,0o) in terms of the random 
variable n(0o - X( n )), where X( n ) is the largest observation. Then use the 
result in Example 7.33 on page 408.) 

49. Suppose that X\ , . . . , X n are conditionally IID with exponential distribu- 
tion Exp(6) given 0 = 0. Let 0 have a T(a,b) prior distribution. Use 
Laplace's method to construct a formula for the approximation to the pos- 
terior mean of 0. How does this compare to the exact posterior mean? 

50. Suppose that X\ , . . . , X n are conditionally IID with Laplace distribution 
Lap(0, 1) given 0 = 0. Let the prior distribution of 0 be Lap(0, 1). We 
wish to approximate the predictive density of a future observation, namely 
J[exp(— \x - 0\)/2\fe\ x (0\x)d0 for various values of x. 

(a) Use Laplace's method to construct a formula for the approximation. 

(b) Describe how to use importance sampling to do the approximation. 




7.6. Problems 



475 



51. Let 0 = (r, Suppose that one wishes to test the hypothesis H : T = 70. 
Let the prior probability of H be positive, and suppose that the prior for 
\£ given that H is true is the conditional prior of # given T = 70 calculated 
from the prior on 0 given that H is false. Prove that the approximate Bayes 
factor in (4.27) is the same as the Laplace approximation of Theorem 7.116 
divided by /r(7o) when the hypothesis is H : T = 70. 

52. Let {X n }n=\ be conditionally IID given (Pi,P 2 ) = (pi,P2) with Ber(pi + 
P2) distribution. Let the prior distribution be (Pi,P2) ~ Dirz(a\, Q2, 0:3). 

(a) Find the posterior distribution of Pi given (X\, . . . , X n ). 

(b) Conditional on (Pi, P2) = (pi,P2), say what happens to the posterior 
distribution found in part (a) as n — ♦ 00. 

Section 7.5: 

53. Let the parameter be 0 = (Mi,M2,£), and suppose that conditional on 
0 = (/xi, /Z2, cr) , . . . , Xi, ni , ^2,1, • • • , X2,n 2 are independent with Xij 
having iV(/ii, a 2 ) distribution for j = 1, . . . , m and z = 1,2. Prove that the 
size a likelihood ratio test of H : Mi = M2 versus A : Mi ^ M2 is also the 
UMPU level a. 

54. *Let a be a known number strictly between 0 and 1/2. The parameter space 

is fi = {(0i,02) : 0 < 0i < a,0 < 0 2 < 1}. Let D be a discrete random 
variable with conditional density given (0i,02) = (01,02) 



The hypothesis of interest is H : 0i = a, 0 2 = 1/2 versus A : 0i < 
a, 02 7^ 1/2. We will observe only one D value, and a will be the level for 
every test below. 

(a) Find the likelihood ratio test and its power function. 

(b) Consider the group consisting of two transformations g+D = D and 
g~D — —D. Show that the testing problem is invariant under the 
action of this group. 

(c) Show that the likelihood ratio test is invariant. 

(d) Find a uniformly most powerful invariant test and compare its power 
function to that of the likelihood ratio test. 

(e) Find the least powerful invariant test. 

55. Suppose that X ~ N(9, 1) given 0 = 9. Let Qh be the set of rational num- 
bers, and let Qa be the set of irrational numbers. Prove that the likelihood 
ratio test with level a of H : 0 € Qh versus A : 0 G CIa is the trivial test 
<j>(x) = a. 



f 0i(l-0 2 ) 



ifd= -2, 

if d= -1, 

if d = 0, 
if d= 1, 

if d = 2. 



/D|e 1 ,e 2 (d|«i,^)= < 




, O1O2 



Chapter 8 

Hierarchical Models 



When a model has many parameters, it may be the case that we can con- 
sider them as a sample from some distribution. In this way we model the 
parameters with another set of parameters and build a model with different 
levels of hierarchy. In this chapter we will discuss situations in which it is 
natural to model in this way. 

8.1 Introduction 

8.1.1 General Hierarchical Models 

We turn our attention now to a situation in which the observations are 
not exchangeable. Suppose, for example, that several treatments are be- 
ing administered in a clinical trial. From each treatment group, we will 
make some observations. It may be plausible to model the observations 
within each treatment group as exchangeable, but it would seem strange 
to model all observations as exchangeable. For each treatment group, we 
might develop a parametric model as we have done elsewhere in this text. A 
hierarchical model for this example involves treating the set of parameters 
corresponding to the different treatment groups as a sample from another 
population. Prior to seeing any observations, we can model the parameters 
as exchangeable. 1 This would mean that we could introduce another set of 
parameters to model their joint distribution. These second-level parameters 



l lt is not essential that we model the parameters as exchangeable a priori, 
but it is mathematically convenient. Such a model corresponds to treating the 
different goups symmetrically prior to observing any data. 



8.1. Introduction 477 



are called hyperparameters. We would then need to specify a distribution 
for these hyperparameters. Here are some examples. 

Example 8.1. Suppose that there are k treatment groups. Let Xij stand for 
the observed response of subject j in treatment group i. We might invent pa- 
rameters Mi,...,Mfc and model the Xij as conditionally independent given 
(Mi, . . . , Mfc) = (/ii,..., /Xfc) with Xij ~ N(tii, 1). We might then model Mi, ... , 
Mfc as a priori exchangeable with distribution iV(6, 1) distribution given 6. Here, 
G is a hyperparameter. We should also specify a distribution for B. 2 Note that 
we have only one B regardless of what k is. 

Example 8.2. A survey is conducted in three different cities. Each person sur- 
veyed is asked a yes-no question. Treat "yes" as X = 1 and "no" as X = 0. 
Then, to each person i in city j, there corresponds a Bernoulli random variable 
Xij. It might seem plausible to treat the Xij observations from a single city i as 
exchangeable. Suppose that we invent three parameters Pi, P2, P3. Then we can 
model the Xij for fixed i as conditionally IID Ber(p) given Pi = p. We would 
then need to construct a joint distribution for (Pi, P2, Ps)- For instance, we could 
model the Pi as exchangeable with Beta(a, ft) distribution conditional on A = a 
and B = (3. Here, A and B are the hyperparameters. We would then need a joint 
distribution for (A, B). Note that we only use a single pair (A, B) no matter how 
many Pi we have in this simple model. 

The intuitive concept of how hierarchical models work is the following. 
Suppose that the data comprise several groups, each of which we consider 
to be a collection of exchangeable random variables. Prom the data in each 
group, we obtain direct information about the corresponding parameters. 
Thinking of the hyperparameters as known for the time being, we then 
update the distributions of the parameters using the data, to get posterior 
distributions for the parameters via Bayes' theorem 1.31. Future data (in 
each group) are still exchangeable with the same conditional distributions 
given the parameters, but the distributions of the parameters have changed. 
In fact, the distribution of each parameter (given the hyperparameters) has 
now been updated using only the data from its corresponding group. Hence 
the parameters are no longer exchangeable. Now, we can also update the 
distribution of the hyperparameters. To do this, we first find the conditional 
distribution of the data given the hyperparameters. Then, we can use Bayes' 
theorem 1.31 again to find the posterior distribution of the hyperparameters 
given the data. The marginal posterior of the parameters given the data 
is found by integrating the hyperparameters out of the joint posterior of 
the parameters and hyperparameters. This is how the data from all groups 
combine to provide information about all of the parameters, not just the 
ones corresponding to their own group. It is the common dependence of all 
parameters on the hyperparameters that allows us to make use of common 
information in updating the distributions of all parameters. A diagram of 
the directions of influence is given in Figure 8.3. A random variable at the 



2 We set all variances equal to 1 for simplicity in this first example. In any real 
application, the variances would be unknown parameters as well. 



478 Chapter 8. Hierarchical Models 



Hyperparameters 



Parametersi 



Datai 



Futurei 



Parameters* 



Data* 



Future* 



Figure 8.3. Schematic Diagram of Hierarchical Model 



pointed end of an arrow has its conditional distribution calculated given 
the random variable at the other end of the arrow. Double-headed arrows 
indicate places where Bayes' theorem 1.31 is used. Future data are included 
in the diagram to indicate how observed data in all groups affect predictive 
distributions of all future data through the hyperparameters. 

In theory, the updating can be performed as follows. We can denote the 
data to be observed as X, the parameters as 0, the hyperparameters as \I>, 
and future data as Y. Assume that X and Y are conditionally independent 
given 0 and Then the conditional posterior density of the parameters 
given the hyperparameters is 



fe\x^(0\x,xp) = 



/x|e,*(*|0,VO/e|*W) 



/x|*0*#) 

where the density of the data given the hyperparameters alone is 

/*|*(a#) = j fx\eM^^)feM^)d9. 
The marginal posterior distributions of the parameters can be found from 

fe\x(0\*) = J fe\xM 0 \ x ^)Mxi^)^ 
where the posterior density of ^ given X = x is 

/xi*W)/*M 



h\x(^\x) = 



fx(x) 



8.1. Introduction 479 



and the marginal density of the X is 



fx(x) = / /x|*(a#)/«ty). 



Finally, the predictive distribution of future data Y is found from the pos- 
terior density of the parameters: 



Hierarchical models were first popularized by Lindley and Smith (1972) 
and Smith (1973) for the special cases of multivariate normal observations. 
Hierarchical models are special cases of partial exchangeability, which we 
consider in more detail in the remainder of this seciton. 

8.1.2 Partial Exchangeability* 

A natural generalization of both hierarchical models and exchangeability 
is the concept of partial exchangeability. There are several types of partial 
exchangeability. Diaconis and Freedman (1980b) present a good overview 
with some examples. 

There are two ways to think about what exchangeability means, and each 
of them leads to a way of extending the concept to partial exchangeability. 
Let Xi,X2y ... be a (possibly finite) sequence of random quantities. 

1. They are exchangeable if, for each n, every permutation of n of them 
has the same joint distribution as every other permutation of n of 



2. They are exchangeable if, for each n, each sequence {2i}™ = i of n 
possible outcomes has the same probability of being the observed 
value of {Xf}^! as every permutation of {zi}f =1 . 

The second description has the drawback that it only makes sense, as 
stated, for discrete random variables. There are ways to make it precise 
for more general cases, but they lose some intuitive appeal in the trans- 
lation. Oddly enough, it is this second description that has the greatest 
potential for generalizing the concept of exchangeability. 

Based on the first description, we have the following restrictive extension 
of exchangeability. 

Definition 8.4. A sequence X\, X2, . . . is marginally partially exchangeable 
if it can be partitioned into subsequences x[ k \X2 k \. . ., for k = 1,2, .. . 
such that the random quantities in each subsequence are exchangeable. To 
which subsequence each X\ belongs must be known in advance, and the 
subsequences (as well as the original sequence) may be finite or infinite. 



fy\x{y\x) = 




them. 



*This section may be skipped without interrupting the flow of ideas. 



480 Chapter 8. Hierarchical Models 



If the subsequences are infinite, DeFinetti's representation theorem 1.49 
can be applied to each of the subsequences to conclude that there must 
exist probability measures corresponding to each subsequence (with some 
unspecified joint distribution) such that the random variables in each subse- 
quence are conditionally independent given their corresponding probability 
measures. Similarly, we can introduce finite-dimensional parametric fam- 
ilies for each subsequence and hence reduce the problem of finding joint 
distributions for the probability measures to finite-dimensional joint distri- 
butions. This is basically what hierarchical models do. (See Examples 8.1 
and 8.2 on page 477.) 

As an example of an attempt to extend the second description of ex- 
changeability, consider a Markov chain {X n }^ =1 of Bernoulli random vari- 
ables. Clearly, not every permutation of a sequence of possible outcomes 
has the same probability. But there are some permutations that do have 
the same probability. In particular, two sequences with the same first value 
and the same numbers of the four types of transitions (0 to 1, 0 to 0, 1 to 
1, and 1 to 0) will have the same probability. For example: 

(1,1,0,1,0,0,0,1,0,0,1,1) and (1,1,1,0,0,0,0,1,0,1,0,1). 

This is certainly not a case of marginal partial exchangeability. It does, 
however, have the intuitive appearance of being a generalization of the sec- 
ond description of exchangeability. Diaconis and Freedman (1980c) give an 
in-depth treatment of the type of partial exchangeability that characterizes 
Markov chains. 

The most general type of partial exchangeability is described by Diaconis 
and Freedman (1984). In fact, it is so general that it is satisfied by arbi- 
trary joint distributions. (See Example 2.118 on page 129.) Nevertheless, 
each specific instance of partial exchangeability leads to a representation 
theorem of the type of Theorem 2.111. In Example 2.116 on page 127 we 
saw that Theorem 2.111 contains a reformulation of DeFinetti's represen- 
tation theorem 1.49 as a special case. In Section 8.1.3, we give examples of 
Theorem 2.111 for partially exchangeable random quantities that are not 
exchangeable. 

8.1.3 Examples of the Representation Theorem* 

In this section, we give examples of the representation theorem 2.111 to 
cases of partially exchangeable random quantities. 

Example 8.5. This example is the one-way analysis of variance with only two 
groups and equal variances. To handle more groups is a simple matter, but the 
notation gets in the way of an initial understanding. Let X n = H n and T n = JR x 
IR+ 0 . Suppose that there is a deterministic sequence {j n }ZLi with jn 6 {0, 1} for 



"This section may be skipped without interrupting the flow of ideas. 



8.1. Introduction 481 



all n. The j n sequence tells us from which group the nth observation comes. Let 

(n n n \ 

i=l i=l i=l / 

Let k n = ket r "('>*) be the uniform distribution on the surface of the 

sphere of radius >Jtz — t\/k n — t%/(n - fc n ) around the point whose zth coor- 
dinate is jiti/kn + (1 - ji)t2/(n - k n ). One can check that the conditions of 
Theorem 2.111 are met. 

We would like to proceed as in Example 2.117 on page 128, but we cannot as- 
sume that the coordinates are IID in the limit distributions. So, we will construct 
the joint distribution of so observations from group 0 and si observations from 
group 1 (for fixed s 0 and si) given T n = t, and see what happens as n — ► oo. Call 
these observations Z = (Z u . . . , Z SQ ) and W = (Wi, . . . , W Sl ). Let 



Then 

fz,W\T n (z,w\t U t 2 ,t 3 ) = 



Since 



- 1 ft 1 * 2 1 ,2\ 1 . 1 , 

r (f) 

r(2=i§=^)(n7r) £fl T £1 aJl 0+ ' 1 
lim ±±1 — — 2~ 2 



n — an — si 



we have that /z,ty|Ty, is asymptotically equivalent to 

(2ir)- a ^ < r-Co+.i) A _ E^i(^-Mn) 2 _ E^^-^ N 

If (T n converges to a e (0, oo) and fj, n and v n converge to n and i/, respectively, this 
function converges uniformly on compact sets to the density of s 0 N{^^) and 
5i ) random variables all independent. If a 2 goes to oo, there is no limit 

distribution. If a goes to 0, and M n -> M and i/ n - i/, the function converges to 
0 uniformly outside of every open neighborhood of the point with ith coordinate 
jJi/i + (1 - ji)z/. In this case the limit distribution is point masses at u and v 
depending on whether j t = 1 or 0. Finally, if a n goes to a finite value and either 
fi» or Vn diverges to ±oo, there is no limit distribution. The extreme distributions 
either have all coordinates degenerate or have the coordinates being independent 
normal random variables with common variance. In either case, there are two 
different means depending on the values of the sequence {j n }£Li. 

Lauritzen (1984, 1988) shows how to characterize much more general 
normal linear models using the representation of Theorem 2.111. See Prob- 
lem 1 on page 532 for an example of the characterization of a conditionally 
partially exchangeable sequence by means of Theorem 2.111. 



482 Chapter 8. Hierarchical Models 



Aldous (1981) introduces a special kind of partial exchangeability that 
arises in the two-way analysis of variance. 

Definition 8.6. Let X = ({X itj ))^ jssl be an array of random variables. 
Let Ri = (X M , X ia , • . .) and C i = (X Uj ,X 2t j> . . .)• We say that X is row; 
and column exchangeable if both {# n }i?Li and {C n }^L x are exchangeable 
sequences. 

Row and column exchangeability is a special case of the conditions of 
Theorem 2.111. 

Example 8.7. Let X be a row and column exchangeable array. Let {(r n , Cn)}SJLi 
be a sequence of pairs of integers such that r n+1 > r n and c n + 1 > c„ with at 
least one inequality strict for each n. Let X n be lR rnCn , and let X n be the first r n 
rows and c n columns of X. That is, we add at least one row and/or at least one 
column each time we increase n. 3 Let T n = (T£, T£), where T„ and are defined 
as follows. For each row of X„, construct the order statistic (smallest to largest) 
of the numbers in that row and then arrange these order statistics according to 
the smallest value in the row. Call the result Define T£ by doing the same 
thing to the columns. For example, suppose that 

/ -1 3 2 0 \ 
X n = 4-2 13. 
\ 1 0-12/ 

Then 

' (-2,0,3) 
(-1,1,4) 
(-1,1,2)' 
(0,2,3) 

Basically, you throw away the information about in which row and in which col- 
umn each number was, but you keep the information about which other numbers 
were in the same row and column with each number. Each of the r n !c n ! matrices 
that can be obtained from X n by permuting the rows and then the columns will 
have the same value of T n . Similarly, all of those r n !cn! arrays can be constructed 
from T n by a somewhat more tedious algorithm. Clearly, r n (-, t) must be uniform 
over those r n !c n ! arrays to preserve row and column exchangeability. 

Finding all of the Q(-,x) distributions is no small task. Aldous (1981) proves 
the following result. An array X is row and column exchangeable if and only if 
there exists a measurable function / : [0, l] 4 H such that X has the same 
distribution as Y = ((V<,j)), with Yij = /(M, A», Bj,G itj ), where M, Ai, A 2 , . . ., 
B\,B<i, . . ., t7i,i, . . . are all IID t/(0, 1) random variables. 



3 Technically, there is a way to write Theorem 2.111 so that it applies to 
partially ordered sets like the set of all (r, c) pairs, but the author thought that 
the proof of Theorem 2.111 was complicated enough without introducing this 
added level of mathematical detail. 



f (-2,1,3,4) 
T: = { (-1,0,2,3), T n c =< 
I (-1,0,1,2) 



8.2. Normal Linear Models 483 



8.2 Normal Linear Models 

A particularly simple case with which to work is that of linear models in 
which the observables are modeled as having normal distributions given 
parameters and the parameters are also modeled as jointly normal (except 
for the scale parameters). 



Suppose that we will observe data from k different treatment groups. Let 
the jth observation in the ith group be Xij. Suppose that we model the 
Xij as conditionally independent iV(/Zi,a 2 ) random variables given M = 
(/xi, . . . , fik) and E = a for j = 1, . . . , and i = 1, . . . , k. Next, suppose 
that we model M = (Mi, . . . , M k ) as a vector of IID N(*Ij,t 2 ) random 
variables given * = ip and T = r. To be precise, we should have said 
that the Xij have iV(/ii,cr 2 ) distribution given M = (/ii, . . . E = <7, 
$ = ip, and T = r. Next, we model * as N(ip 0 ,T 2 /£o) given T = r and 
E = a. Finally, we need a joint distribution for (E,T), which will remain 
unspecified for now. 

The joint distribution of all quantities can be summarized as in Table 8.9. 
Future observations have a distribution like the first stage. The posterior 
distribution has only the last three stages. Let X stand for the entire data 
vector, and let x be the observed value. 

Conditional on E = a, T = r, and # = ip, the posterior of the Mi is 
found from simple normal distribution updating. The M £ , given E = a, 
T = r, and * = -0, are independent with M* having distribution 



The conditional joint distribution of the data given * = t/;, E = a, and 



8.2.1 One-WayANOVA 




(8.8) 



where 



/ii(^,a,r) = 



njXjT 2 + ipcr 2 
riiT 2 + a 2 



Stage 



Table 8.9. Hierarchical Model for One-Way ANOVA 
Density 



Hyperparameter 
Variance 



Data 



Parameter 



(2™ 2 )-* exp {-2^ £tiM^ - Pi ) 2 + (m - 1)5?]} 

(2^)-§exp{-^E-= 1 (Mi-V-) 2 } 
VCo(27TT 2 )-i exp {-$f(V - Vo) 2 } 



/e,t(<t, t) 



484 Chapter 8. Hierarchical Models 
T = r, is that the Xi and Si are independent with 4 

It follows that the posterior of \I> conditional on E = a and T = r is 



(8.10) 



ty\X = E = (j,T = T~iV t/;i((7,r), 



Co ftj 

T 2 <7 2 + T 2 ^ 



— IN 



where 



(8.11) 



To find the posterior distribution of (E,T), let X stand for the vector 
with coordinates X\, . . . , Xk> Then, the conditional distribution of the data 
given E = a and T = r is 



where 



X~N k {^l,W{o,T)), S\~ -Xn 4 -1, 

71^ J. 



(8.12) 



W(<T,t) = 



Co 
Co 



T 2 


T 2 


Co 


Co 




r 2 




Ca 


T 2 




Co 





1 + 



A] 



It follows that / E)T |x,5 2 ,...,5 2 ( a ' T \ x > s i» • • • > s fc) is P r °P° rtional to 
/E,T(^r)a-t ni+ - +nfc -^|W((T,r)r i 

- i(x - ^oD^" V, r)(3? - ^ol)^ • 

There is one special case in which the above formulas simplify tremen- 
dously. For fixed A, suppose that 







\ i=i 


I 2 ^ 2 J 



4 The reduced model in (8.10) is sometimes called a variance components 
model. In this model, the vector M is not of interest, but rather only the vari- 
ance T of its coordinates. The two terms involving r and a are components of the 
variance of the observations. Hill (1965) gives a Bayesian analysis of such models. 



8.2. Normal Linear Models 485 



That is, T is just a known scalar multiple of £. In this case, it is convenient 
to define 70 = A£o so that 



T 2 <7 2 



rtiT 2 + a 2 
V>i(<t,t) 



Co . v 1 ^ rii 

W(a y r) = a 2 



TiiXi + ipX 

— — = M^h 
V 

SjLi 7i^i + 7o</>o _ , 
7o + Ei=i 7t 



i=l 

l + i + l JL 

m A 70 70 



1 

70 



70 

7 1<\ 1 



— — 4- - 4- 
70 n k A 70 J 



where 



Af = A-fni, 7i 



Note how //; and ^1 no longer depend on a and r. In fact, we can use 
Proposition 8.13 below to show that 

Proposition 8.13. 5 Let B be a positive definite kxk matrix, and let x be 
a vector of dimension k. Define A = B + cxx T , then 



A = B ~ l ~ TZ to 1 B-'xx T B~' 
1 + cx 1 £? _1 a; 

and |i4| = IBKH-cx 7 ^- 1 ^). 

In this case, there is a simple conjugate prior for E which makes the 
posterior of E of the same form. That would be that E 2 has inverse gamma 
distribution T 1 (a 0 /2, 6 0 /2). The posterior of E 2 would be r~ 1 (ai/2, &i/2) 
where ai = a 0 + ELi "i. 7. = Eti 7i, « = E*=i 7i^i/7», and 

k 

h = 60 + ^(^-1)52 +o 2 (x-i/j 0 l) T W- l (a 1 r){x-^ Q l) 
i=l 
k 

= b ° + £ " !) s i + " «) 2 } + -^(t* - Vo) 2 . 



7o 4- 7* 



5 This proposition is also used in the analysis of the two-way ANOVA in Sec- 
tion 8.2.2. 



486 Chapter 8. Hierarchical Models 



Posterior distributions for linear functions of location parameters are 
now t distributions as are predictive distributions of future observations. 
For example, if Y is the average of m future observations from population 
z, then some of the various posterior and predictive distributions are 



Mi 



t ai (ipu— — 7— ) , 
V a i 7o + 7* / 



t ai Mi(</>i), 



h 
ai 




m Xi ^Ai) 



7o + 7* 



When T is not a known scalar multiple of E, there is still a way to simplify 
the formulas slightly. That is, introduce A = E 2 /T 2 as a replacement for 
T in the hyper parameter. Then, the simplified formulas are still correct 
as long as they are understood to represent conditional distributions given 
A = A. In this case, it is also possible to let the values ao,&o>Co> and t/>o 
depend on A, if one wishes. The posterior for A is not particularly simple, 
but it is the only part of the posterior that is not simple. It is proportional 
to /a (A) times 



r(f) 



\i=i ) 



i 



1 + 



i-4 



Numerical integration is required to make any marginal (not conditional 
on A) or predictive inferences. 

The model just described can be used to find a solution to the problem 
that gave rise to the James-Stein estimator in Section 3.2.3. In that problem 
Pr(E = 1) = 1, but otherwise it is the same. Assuming E to be unknown, 
the above model gives the posterior mean of Mj to be 



E(M j |X = x)= / 
Jo 



33 feA+sr,!'* 



-/a|x(A|*)<*A. 



Since the integration on the right-hand side is over A, we should see explic- 
itly where A is. So, we rewrite the formula as 



E(M j \X = x) = 

Jo 



A + Tlj 



-/A|x(Ak)dA 



= E[aj(A)|X = x]xj + E[{1 - a j (A)}u(A)|X = x], 

where a, (A) = Vfa + A) and v(X) = Vifa t) is itself a weighted average 
of rpo and a weighted average of all of the sample averages. This is similar 



8.2. Normal Linear Models 487 



to the empirical Bayes modification to the James-Stein estimator, namely 
(3.55) on page 165. The way it behaves can be understood as follows. A 
is a measure of how much more spread there is within each group relative 
to the spread between groups. Now, suppose that all = 1. Then v(X) 
is k/(l + A) times x plus Co times ipo all divided by fc/(l + A) + Co- If the 
posterior distribution of A is concentrated near 0 (that is, there is far less 
variation within groups than between), then v(A) will get very little weight, 
since a (A) will be close to 1. This makes sense because the large spread 
between the means suggests that the information from xj is much more 
valuable than the other XiS. If, however, A has lots of mass for large values, 
then there will be a great deal of shrinkage, and v(A) will be near ipo. For 
distributions of A concentrated on intermediate values, Xj, x, and Vo all 
receive moderate weight. 

Example 8.14. Consider the following data gathered from three groups: 



i 


1 


2 


3 


rii 


10 


12 


15 


Xi 


27.9268 


18.1622 


19.5475 


si 


23.8227 


57.6736 


32.3858 



Suppose that we want to have E 2 and T 2 be independent in the prior distri- 
bution. Suppose that the prior for E 2 is r -1 (a 0 /2, b' 0 /2) and the prior for T 2 
is r _1 (co/2,do/2). Then A has the distribution of 6oco/(ao<fo) times an F Co , ao 
random variable. The conditional distribution of E 2 given A = A can be shown 
(see Problem 11 on page 534) to be r -1 (ao, 60(A)), where ao = [a 0 + c + 0]/2 and 
60(A) = [6 0 + Ado]/2. Suppose that the rest of the prior distribution is specified by 
i^o = 10, Co = 0.1, a f 0 = 1, 6 0 = 10, Co = 1, and do = 1. The posterior distribution 
of A can be found approximately, its mode is around 1.07, and it has probability 
of about 0.94 of A < 10. Hence Qj(A) = rij/[A + rij] is close to 1 with high 
probability for all j, and there will be little shrinkage toward the overall mean. 

We can numerically calculate the posterior distributions of the three M» using 
either Laplace's method (Section 7.4.3) or importance sampling (Section B.7). 
For Laplace's method, the "O" is A, and the function g(X) is one of the posterior 
densities of the Mi given A = A evaluated at various values of \i. These densities 
are t ai with location and scale 

rMS« + fr(A)A fh{X) I 1 ~~A 2 1 

A + n< ' and V ai VA + n i + (A + n i ) 2 Co + EL 1 7/ 

where V>i(A) = (CoA^o + Y?i=i 7*^0/(Co + 7*)> an d 61(A) is the same as 61 
with 60 replaced by 60(A). 

For importance sampling, we sampled 1000 values from the prior distribution of 
A and used these to approximate the integrals that equal the posterior densities at 
various \i values. We also used the delta method to calculate standard deviations 
for the density values and found these to be at most 0.09 times the density values 
in all cases (less than 0.05 times the density in 80% of the cases). 

The three posterior means were calculated from the posterior densities and 
were found to equal 26.08, 18.86, and 19.86. We see that some shrinkage has 
occurred. The numerically evaluated densities are shown in Figure 8.15 together 
with the results of an empirical Bayes analysis to be described in Section 8.4 and 
a successive substitution sampling analysis to be described in Section 8.5. 



488 Chapter 8. Hierarchical Models 




1 1 1 1 1 1 1 

10 15 20 25 30 35 

Figure 8.15. Numerical Approximations to Posterior Densities 

One could generalize this model to the case in which the variance of 
Xij conditional on the parameters is E?. In this case, one can only ob- 
tain closed-form posteriors conditional on all of the variance parameters, 
E^, . . . , E|,T 2 . Numerical integration over all k + 1 variance parameters 
would then be needed. We postpone illustration of this until Section 8.5, at 
which time we introduce an alternative method of solution that is better 
suited to this type of problem. 

8.2.2 Two- Way Mixed Model ANOVA* 

In this section, we examine a two-way analysis of variance with one random 
effect and one fixed effect and equal numbers of observations per cell. The 
recommended analysis of this model will be described in Section 8.5. The 
analysis given here is mainly motivational as well as illustrative. 
Suppose that 

Y itjlk = M + Ai + B,- + (AB) itj + €ij, k , (8.16) 
where A stands for the random effect and B stands for the fixed effect and 
b b 
]T B 3 f = 0 = 5^(AB) ifj , for all i, (8.17) 

3=1 3=1 

for i = 1, . . . , a, j = 1, . . . , i>, and k = 1, . . . , m. We suppose that the e itj) k 



This section may be skipped without interrupting the flow of ideas. 



8.2. Normal Linear Models 



489 



are conditionally IID iV(0,<7*) given = o\ (and all other parameters). 
It is also traditional to assume that, conditional on other parameters, the 
A< are independent of each other and of the Bj and the (AB)^ , that the 
Bj are independent of the (AB)ij, and that the (AB)i,j for different i are 
independent of each other. We can let 

M ifj = M + A* 4- Bj + (AB)ij, 

and put these into vectors Mi = (M^i, . . . , Mi^) T > We can then express the 
model described above by saying that the Mi are conditionally independent 
Nb(0, E) vectors given © = 6 and E = a. Here E is a b x b matrix and 
0 = (M + Bi, . . . M + B b ) T . In order to ensure that (8.17) is reflected in 
the conditional distribution of Mi given M, we assume that E has the form 

where 1 is 6-dimensional. At the next stage of the hierarchy, assume that 
the coordinates of 6, namely 6i,...,e 6 , are conditionally IID N(fi,a 2 B ) 
given M = /i and E B = o B (and other parameters), since Oj = M 4- Bj in 
the notation of (8.16). At the next stage, model M ~ N(ii 0 ,<t 2 b /t) given 
£b = °b (and other parameters). Finally, E e , E^, E AB , and E# have some 
joint distribution. In summary, we have Table 8.18. 

One way to proceed, after collecting data, would be to march through 
the levels of the hierarchy, finding all of the posterior distributions. This 
is done in much the same way as in the simpler model of Section 8.2.1, 
but with an extra level in the hierarchy. Alternatively, we could take an 
approach that is typically done in the classical analysis of this model. That 
approach is to pretend that some of the parameters are not of interest and 
integrate them out of the model. In particular, the Mi and Qj are usually 
integrated out of the classical analysis. This is easy to do in the model of 



Table 8.18. Hierarchical Model for Two- Way ANOVA 



Stage 


Random Variables 


Distribution 


Conditional on 


Data 


Yi,j,k, for all i, j, k 


independent 


All Mt (j = fJL itjl E e = <T e , 

M = /x, B = 0, S A = cr A , 
Eb = <tb, Eab = oab 


Parameter 


Mi for all i 


independent 
N b {0,a 2 A ll T 

+*ab!'- t" T ]) 


M = /X, E e = CT e , 

£a = <ta, e = e, 

Sb = o"b, Sab = <jab 


Hyper- 
parameter 




independent 
N 


£a = & A » M = /X, 

Sab = ^ab>Eb = <tb, 


Hyperhyper- 
parameter 


M 


^(/xo.alr- 1 ) 


Sa = 0"A, S e = <7 e , 

£b = o-b, Sab = (tab 


Variance 


£ e , Sfi» Ea> EaB 


Whatever 





490 Chapter 8. Hierarchical Models 



Table 8.18. To do this, we note first note that sufficient statistics are 

a b m 1 m / 

= E E D»w.* - ^;) 2 ' F « = -E 

i=i j=i fc=i 



The distribution of the Yi is that of independent 6-variate normals, with 

a 1 

distribution iV&(/Zi, To integrate the out of the model conditional 
on everything else, we note that the distribution of RSS depends only on 
E e , so only the distribution of the Yi changes to 



iV 6 ^^J + ^ll T +^ B [/-ill T ]), 



(8.19) 



and they are still conditionally independent. This means that we can reduce 
the sufHcient_statistic even further. We will still need i?SS, and of course 
we will need Y. = J2i=i Because of the special form ofthe covariance 
matrix of the Y{, we do not need the whole matrix Y^=\^Yi ~ Y.)(Yi — 
Y ) T , which would be required if the covariance matrix were unconstrained. 
Instead, one can use the fact that 



(Sf' + * lT+ »i»['-i uT ]) 



-i 



_ Zam. 



I - 



+ °AB 



11 



to write the conditional density of the Yj in terms of Y. and 



(8.20) 



SSA = ^6m(yi,. -y.,.) 2 , 

SSI = EE m (^- F v-^+ F ,.) 2 

t=l j=l 

= £ m(Yi - Y.) T (Yi - Y.) - SSA, 

i=l 

where F if . is the average of the coordinates of F» and Y. t . is the average 
of the coordinates of F . These two sums of squares are the usual sums of 
squares for the random effect and for interaction, respectively. The condi- 
tional distribution of Y, given the parameters is the same as (8.19) except 
that the covariance matrix must be divided by a because we averaged a 
independent vectors with the same distribution. 

To integrate 6 out of the distribution, we note that SSA and SSI de- 
pend only on E e , Z A , and E A *. Using (8.20) once again, we can write the 



8.2. Normal Linear Models 491 



conditional density of Y. in terms of F v and SSB = $3* =1 ma(y fj —y ) 2 , 
where Y m j is coordinate j of Y., and SSB is the usual sum of squares for 
the fixed effect. In summary, the sufficient statistics for the model that in- 
volves onlyjthe parameters E e , Ea, Eab, Ejg, and M are RSS, SSA, SSI, 
SSB, and Y. t .. The conditional distributions of these quantities given the 
parameters are easily calculated using the fact that they are functions of 
orthogonal transformations of the original data. Hence, they are all condi- 
tionally independent given the parameters, and their distributions are 

RSS ~ <T 2 e xl[m-l)> 

SSB ~ (<T 2 e + m<r 2 AB + amaDxl-x, SSA ~ (<t2 + ™*M-i> 

SSI ~ (tfe+™^B)X[a-l][fe-l]« 

At this point, the classical analysis differs from any further Bayesian 
analysis. The classical analysis usually ignores Y. f . and makes inference 
based on the sums of squares. Since the distribution of Y. f . depends on 
some of the variance parameters, a Bayesian would still make use of it, 
even if interest was solely in the variance parameters. In particular, we could 
integrate M out of the problem and see that the conditional distribution of 
y. } . given the variance parameters alone is 

N Uo, r^- + + — . 
\ bam a r J 

8.2.3 Hypothesis Testing 

On page 384, we found a UMPUI 6 test for the hypothesis of equal means 
in a one-way analysis of variance. This test was the usual F-test. In Sec- 
tion 4.5.6, we illustrated how the usual F-test was a Bayes rule in a decision 
problem. This decision problem had the property that the prior probability 
was positive that all the means were equal. It may be that we do not feel 
that exact equality between the means has positive probability, but we are 
still interested in how far apart they are. In fixed-effects models, there is a 
straightforward way to measure the differences between the means which 
resembles the F-test but uses a prior in which the probability is zero that 
any two means are equal. 

Suppose that X has N n (gl3,a 2 v) distribution given (B,E) = (/3,cr), 
where g and v are known matrices with g being n x p and v being n x n, 
respectively. Let the prior distribution be that B ~ N p (Pq,(j 2 Wq 1 ) given 
E = a and E 2 ~ r -1 (ao/2,&o/2), where wq is a known, nonsingular p x p 




Uniformly most powerful unbiased invariant. 



492 Chapter 8. Hierarchical Models 



matrix. The sufficient statistics from a sample X = x are 

RSS = {x-g0) T v- l {x-gj3). 

The posterior distribution given X = x has the form B ~ N v ((3\,cr 2 w^ 1 ) 
given E = a and E 2 ~ r -1 (ai/2, fci/2), where 

ai = a 0 +n, ty x = WQ+g T v~ l g, 

fa = wf^oft + (s 7 ^ 1 ^), 

fci = bo + RSS + iPo-pywow^igTy-igKPo-P)- 

Let Bo = {P - aft = ^0} where a is a g x p matrix of rank q < p ) 
and ^0 is some ^-dimensional vector in the column space of a. Suppose 
that we are interested in how far B is from Bo- Let /?o € Bo, and define 
H = (B - /3 0 ) T h(B - #>), where ft = a T '(au;f 1 a T )" 1 a. Note that 

H = (aB - ^ 0 ) T (a^r lflT )~ 1 (aB - ^0) 

is the same no matter which /3o € Bo is used in its definition. There are 
two natural ways to measure the distance between B and Bo- One is by 
p(B,B 0 ) = H/trace(ft), and the other is by p(B,B 0 )/E 2 . The reason for 
the trace(h) in the denominator is two-fold. First, there is a sense in which 
ft measures the precision of that part of B that lies in Bo. Because of this, 
there are two factors that contribute to (B-/3 0 ) J ft(B-/3b) being large. One 
is how far B is from and the other is how precisely we know B. Only the 
former should contribute to the distance between B and Bo- The latter can 
be used to judge how well we know the distance between B and B 0 , but not 
to increase the actual distance. This means that we must adjust H somehow 
to remove the effect of the precision. The trace of ft is a natural way to 
do that. The second reason for the trace of ft is that it is invariant under 
alternative representations for the set B 0 . That is, if B 0 = {/? : c/J = ipo} 
also, then trace(c T (ctt;f 1 c T )~ 1 c) = trace(ft). 7 

To express our uncertainty about the distance between B and Bo, we 
need the distribution of H and/or the distribution of H/E 2 . 
Theorem 8.21. Suppose that, conditional onY = y, the distribution of Z 
is NCx 2 q (y)- Suppose also that Y has r(ai/2,6i/[2c]) distribution, then the 
marginal distribution of Z is ANCx 2 (<l>*ui)> where 7 = c/(c + 61). (See 
page 668.) The mean of Z is E(Z) = q + cai/h = q + a^/(l - 7). 

Proof. The conditional density of Z given Y = y is 

/ W (*)-f«p(-?)fiS* ,w ^-s)- 



7 For those with a background in multivariate analysis, it is possible to show 
that H/trace(ft) is the weighted average of the squares of the principal compo- 
nents of the projection of w 1/2 (B - /So) into the space w l/2 B 0 . 



8.2. Normal Linear Models 493 



The marginal density of Y is 



/y(y) = ^- lexp (_^) 



r(¥) 

The joint density is 

; ' 1 S <lr (§ + 0r(*) y 

xexp(-|[^ + l])z^-exp(-|). 
Integrating out y gives the marginal density of Z 

Use the formula for 7 to complete the proof that the distribution is A JVC* 2 . 
The mean of Z given Y = y is q 4- y, and the mean of y is cai/bi. □ 

Theorem 8.22. Suppose that the conditional distribution of ZY given 
Y = y is noncentral \ 2 , NCx\{cy). Also, suppose that the distribution 
ofY is r(ai/2,6i/2), then the marginal distribution of a\Z/[q{b\ + c)] is 
ANCF(q,awy), where 7 = c/(c 4- 61). (See page 669.) If a\ > 2, the mean 
ofZ isE(Z)=c + qb 1 /(a 1 -2). 

PROOF. The conditional density of Z given Y = y is 

OO / Qy \ i £ , . 

/ - ( *»=g-(-?)¥ 5 F^y^ t '--(-f)- 

The marginal density of F is 

The joint density is fz, Y (z, y) equal to exp(-p[&i + c + z]/2) times 

^ i! 22 9 +ir(f + i)r(^)' 

Integrating 3/ out of this gives the density of Z: 

f2 (z) = £ r(fr + 3 + 2<) 



494 Chapter 8. Hierarchical Models 



Now, make the change of variables from z to u = z/(z + b\+c). The inverse 
is z = (b\ + c)u/(l - u). The derivative is (61 + c)/(l - u) 2 . The density of 
tf = Z/(Z + 61 +c) is 

V- r(f + f+2t) J+i _ lM 

^(fci + c)3+»t!r(f +fjr(-^j 

Setting 7 = c/(c + h) and rearranging T function values produces the 
ANCB(q,a\ 1 / y) density. We know that 

which must have ANCF(q, 01,7) distribution. The mean is obtained by 
noting that E(Z\Y = y)=c + q/y and E{l/Y) = b x /(ai - 2) if a x > 2. □ 
In Theorem 8.21, let Z = H/E 2 . Since aB - ipo has multivariate normal 
distribution N q (a(3\ - ^o^ 2 ^! -1 ^) gi yen E = cr, it follows that Z has 
noncentral x 2 distribution with 9 degrees of freedom and noncentrality 
parameter y = (aft — Vo) T (awf 1 a T )~ x (aft — ipo)/a 2 given E = a. Since 

c = (aft - t/; 0 ) T (a^r 1 a T )- 1 (a/?i - Vo) (8.23) 

is a constant in the posterior distribution, we can let Y = c/E 2 , which 
has r(ai/2,6i/[2c]) distribution. It follows that the distribution of H/E 2 is 
ANC X 2 (q,a ul ). 

In Theorem 8.22, let Z = H and Y = 1/E 2 . Now, ZY has noncentral 
X 2 distribution with g degrees of freedom and noncentrality parameter cy 
given Y = y. Also, Y ~ r(ai/2,6i/c). It follows that the distribution of 
aiH/[q{b x + c)] is ANCF{q, a u l)- 

Example 8.24. We will use the same data as in Example 8.14 on page 487, but 
we will use a conjugate prior for the parameters E and B = (Mi, M2, M3) . The 
design matrix g is particularly simple and v is the identity matrix. We get g T v~ l g 
to be the 3 x 3 diagonal matrix with 10, 12, and 15 on the diagonal. Suppose 
that the prior has hyperparameters 

ao = 1, bo = 10, 

/ 10 \ / 6.7742 -3.2258 -3.2258 \ 

0n = I 10 I, wo = -3.2258 6.7742 -3.2258 . 

\10 J V "3.2258 -3.2258 6.7742 / 

The posterior distribution has hyperparameters 

ai = 38, 61 = 1729.7, 

16.7742 -3.2258 -3.2258 \ 



ft = 



/ 24.44734 \ 




19.43751 , 


m = ^ 


V 20.11565 J 





-3.2258 18.7742 -3.2258 
-3.2258 -3.2258 21.7742 



Now, suppose that B 0 = {(3 : ft = ft = ft}- This can be represented by the 
matrix and vector 



-({ ; :{)■ *"(!!)• 



8.3. Nonnormal Models 495 




— i 1 1 1 1 r- 

O.O 0.2 0.4 0.6 0.8 1 .0 

V 



Figure 8.25. CDF of V = y/tt/[M.3Z3lZ 2 } 

The noncentrality parameter of the alternate noncentral distributions can be 
calculated to equal 7 = 307.5124/(1729.7-1-307.5124) = 0.1509. The trace of h is 
44.3331. 

If we want to describe our uncertainty about how far apart the Mi are, we could 
look at the CDF of some function of H or of H/E 2 . For example, Figure 8.25 gives 
the graph of the CDF of V = y^H/^^lE 2 ]. We see that it is almost certain 
that the average distance between the Mi is less than £ and there is a 95% chance 
that the average distance is at least 0.18E. 

8.3 Nonnormal Models* 

Hierarchical models are useful for problems in which data have any sort of 
distribution. We will give two examples in this section. 

8.3.1 Poisson Process Data 

Suppose that several stochastic processes are being compared. For example, 
each process may be registering the occurrence of defects produced by one 
of several machines. Or, each process may be registering the times at which 
a criminal is arrested. Suppose that we model the processes as Poisson 
processes conditional on parameters 61, . . . ,0fc, so that process i has rate 
Oi given 0* = 0 i# We could then model the 0* as a priori exchangeable 



"This section may be skipped without interrupting the flow of ideas. 



496 Chapter 8. Hierarchical Models 



random variables with r(a, f3) distribution given A = a and B = /?. We 
would then need a distribution for (A, B). Suppose that the data for process 
i consist of Ti units of time and N { occurrences. The posterior distributions 
of the OiS given A = a, B = /?, Ti = t», and Nt = n* are of independent 
random variables with 9^ having r(a+rii, (3+ti) distribution. The posterior 
density of (A, B) is proportional to 

/A3(a,/3) f ^II (/3 + t|)a+nt . (8-26) 

This would require numerical integration or approximations in order to 
make use of it. 

Example 8.27. Suppose that JV» is the number of times an individual is arrested 
in Ti units of time (months). We will assume that T» is independent of the param- 
eters, and that conditional on Ti = U and 0i = #i, . . . , 0 n = 0fc, A = a, B = /?, 
the Ni are independent Poi(U0i). The Oi are modeled as IID T(a,/?) given 
A = a, B = 0. We will use the following prior distribution for (A, B): 



B|A = a-T 



In this prior, B/A is independent of A. Suppose that we use the prior hyperpa- 
rameters a (0) = 1/2, 6 (0) = 1, c (0) = 13, and d (0) = 1. The data consist of A; = 6 
individuals with the following observations: 



Subject (i) 


1 


2 


3 


4 


5 


6 


Time {U) 
Number (n») 


36 
2 


27 
3 


14 
1 


6 
1 


20 
2 


30 
2 



We will illustrate two numerical techniques for drawing inferences from this data 
and model. Suppose that we want the predictive distributions of the numbers of 
arrests in a future 24-month period for two different individuals. One of them 
is the second observed individual in the data set, and the other is an individual 
not in the data set but deemed to be a priori exchangeable with them. Denote 
these individuals by i = 2 and i = 7, respectively, and denote the numbers of 
arrests by M 2 and M 7 to distinguish them from the observed data. What we seek 
is fMi\x(n\x) for i = 2, 7 and n = 0, 1, . . ., where X = x is the observed data. We 
can write 

/M,|x(n|x) = J j /M 4 |x,A3( n k. a »^)/A.B|x(a,/?|a?)A»*3, 

j fui |e< (n|0)/e 4 |x. a.b (*|x, a, /3)<W, 
/M,ie 4 (n|*) = exp(-240)M^, 
/e,,*.A,B<«l«,«,0 = ^^r^ex P [-^ + 27)], 



8.3. Nonnormal Models 497 



fM 2 \x^B(n\x,OL,P) 
fM 7 \x,A&(n\x>OL,fi) 

Therefore, we need to be able to integrate these last two expressions times the 
expression in (8.26) renormalized to be a density. The normalization constant is 
fx(x), the integral of (8.26) over a and 0. 

First, we used Laplace's method from Section 7.4.3, since all the functions 
being integrated are positive. The "0" in this example is (A,B), and g(6) is 
one of the several functions obtained by fixing n in either /m 2 |x,a,b(^|^ 5 P) or 
/m7|x,a,bM#jOJ)/3) from above. Due to the form of the prior, it seemed sensible 
to transform to (A,B/A) before applying Laplace's method. 

Second, we used importance sampling (see Section B.7) to integrate numeri- 
cally. We used a single set of 100, 000 pseudorandom pairs drawn from the prior 
distribution of (A, B) to perform all of the integrals. We also calculated variances 
using the delta method for each ordinate. The results for M2 are shown in Fig- 
ure 8.28, and those for M7 are shown in Figure 8.29. The standard deviations 
of the importance sample ordinates were all at least two orders of magnitude 
smaller than the ordinates themselves. As we can see in the figures, the two 
methods produce nearly the same results. 



24 n (/9 + 27) a+3 r(a + 3 + n) 
n!r(a + 3)(/? + 51)*+ 3 +" 
24 n /T r(a + n) 
n!r(a)(/? + 24)°'+ ri ' 



8.3.2 Bernoulli Process Data 

Suppose that we can collect counts from several different sources. For ex- 
ample, we might be administering several treatments and we count how 
many recoveries occur in each treatment group. The data from group i will 
consist of rii, the number of subjects in the group, and A^, the number of 



0 


Laplace 


X 


Imp. Sam. 



5S _ 



FIGURE 8.28. Numerical Approximations to Density of M2 



498 Chapter 8. Hierarchical Models 



^ - 



E ^ 




Figure 8.29. Numerical Approximations to Density of M7 



successes, for i = 1, . . . , k. We model the successes as Bernoulli processes 
conditional on parameters Pi, with the probability of success in group i 
being Pi given Pi = Pi. We can model the Pi as exchangeable random vari- 
ables with Beta(6r, [1 - 0]r) conditional on0 = 9 and R = r. Here, © 
is like the average probability and R is like a measure of similarity. The 
larger R is, the more similar the Pi are. The posterior distribution of the Pi 
given © = 0, R = r, and Xi = Xi is that of independent random variables 
with Pi having Beta(9r + x u [1 - 0]r + n< - a?i) distribution. The posterior 
density of (0, would be proportional to 

f /a x T(r) fc A T(flr + gQrqi - gjr + n« - Si) 

This would require numerical integration or approximations in order to 
make use of it. 

One possible approximation that is available puts this problem into the 
normal model f ramework. If the rii will be large, we can model Yi = 
2arcsin y/X~/ni as approximately AT(2arcsin Jpi, l/n<) random variables 
given Pi = Pi. We could then use the same transformation on the Pi to 
model the Mi = 2 arcsin y/T\ as approximately JV(/i, l/r) given M = /i and 
T = r. Then M can be modeled as N{^°\ l/(Ar)) given T = r, and T 
can be given some distribution. Here, M plays the role of 2 arcsin VQ and 
T plays the role of R from the earlier model. The posterior distribution of 
the Mi given M = \i and T = r is that of independent random variables 



8.3. Nonnormal Models 499 



with Mi having 7V(^(/i), l/(r + rii)) distribution, where 

fiT + riiyi 

— — • 

The posterior of M given T = r is N{^ l \r), 1/[t7(t)]), where 

frl n i + T ivr) 



i=l 



The posterior for T cannot be given in closed form, but the density is 
proportional to /t(t) times 

fl(r + nO-^r)-* exp^-I |„( T ) + ^(m (0) - J/(r)) 2 | j , 

where 

Once again, if m is large for each z, then ^/(n* + r)«land n { + r« 
rii for each i It follows that A(r) is approximately k and that y(r) is 
approximately the average of the y u say y. Hence, the posterior density of 
T is approximately proportional to 

/ T (r)rtexp(-l[ w + ^. (/x (0)_^J^ 

where w = Y£=i(.Vi ~ V) 2 - Also, the conditional posterior for M given 
T = r is approximately 



AT 



V 0) + Et,yi i 



S/ T n!f a „^ a(0) /2,& (0) /2) prior, then the approximate posterior of T is 
r(aW /2,bW /2), where 



= oW + fc, 

6W = 6(0) + w + ^Ta(m (0) -F) 2 - 

Of course, using these same approximations, the conditional distribution 
of Mi given M and T would be N(y it 1/fc), which is independent of M and 
T anyway. 



500 Chapter 8. Hierarchical Models 

8.4 Empirical Bayes Analysis* 

Classical statisticians try to make use of hierarchical models either by leav- 
ing the hyperparameters at various stages of the hierarchy unspecified or 
by not specifying a distribution for the hyperparameters at certain stages. 
This allows them to treat these values as "unknown parameters" in much 
the same way that they treat parameters in other models. For example, 
the hierarchical model in Table 8.9 could be altered by letting T, and E 
be unknown parameters to be estimated without specifying distributions. 
In Table 8.18, we could let M, E 2 , E#, E^, and Y? AB be the parameters 
by integrating the other parameters out the way we did in Section 8.2.2. A 
good introduction to empirical Bayes analysis was given by Morris (1983). 
Robbins (1951, 1955, 1964) first introduced the term "empirical Bayes" and 
the general methodology. 

8.4.1 Naive Empirical Bayes 

The naive approach to empirical Bayes analysis is to estimate the hyper- 
parameters at some level of the hierarchical model and then pretend as if 
these were known a priori and use the resulting posterior distributions for 
parameters at lower levels in the hierarchy. For example, in the one-way 
ANOVA (see Table 8.9), we could use (8.12) to specify the joint density of 
the data given the parameters *,T, and E. Then we could let A = E 2 /T 2 
so that the likelihood of E 2 , and A is 



where n = ]Ci=i n *- For fixed a 2 and A, this is maximized over t/> by 
choosing 




+ {m - 1)5? 




If we plug this value for i/> into the likelihood and maximize over a 
fixed A, we get 



2 for 




This section may be skipped without interrupting the flow of ideas. 



8.4. Empirical Bayes Analysis 501 

If we plug this value for a 2 into the likelihood, we get the following function 
of A to maximize: 

[£ 2 (A)]-?. (8.30) 



This would produce the MLE of A, call it A. Then set E 2 = E 2 (A) and 
^ = ^(A) to get the overall MLEs. Then, we can make inference about the 
M*s by using the conditional distribution in (8.8). 

In the special case in which all = m, 4f(X) = £* =1 Xi/k and is not a 
function of A. Also E 2 (A) simplifies and the derivative of (8.30) can actually 
be set equal to zero to solve for the maximum. If we let g = 1/ra + 1/A, 
then 

2?(A) = (rig)- 1 £>, - *) 2 + — T s l 
and the derivative of the log of (8.30) becomes 

2<7 2E»(A) ng* • 

Setting this equal to 0 gives g = YlLifa - *) 2 /[fcE 2 (A)]. Solving for 5 
yields # to equal a multiple of the usual F statistic for testing the hypothesis 
of no difference between groups: 



._ ("-fc)Xli(3fi-ft) 2 _ fc-1 
fc (^-l)E-=iS 2 fcm J 



Of course, g > l/ m is required. If F < k/(k - 1), the derivative in (8.31) 
is negative at g = l/m, so the maximum occurs at g = l/m. Hence the 
MLE of A is 



oo otherwise. 



This means that f 2 = 0 if F < Jfe/(fc - l). 

Example 8.32 (Continuation of Example 8.14; see page 487). Using the data 
m this example, we can calculate the likelihood function for A and maximize it 

S e m ofJ™ 2™ a ? A ^ 2 ' 614 ' The oth <* MLEs are V = 21.78441 and 
L » 38 ' 37032 : T ^ makes = 14.67878. Now, we could use (8.8) to say that 
the approximate distribution of Mi is 

\ A + m ' m+h) ' 

For the three groups, these distributions are respectively 

iV(26.6539,3.04188), #(18.8101, 2.62559), #(19.8795,2.17840). 



502 Chapter 8. Hierarchical Models 



Example 8.33 (Continuation of Example 7.60; see page 420). In this exam- 
ple, each observation is a pair (Xi,Yi) that are conditionally IID JV(/Xi,a 2 ) and 
the pairs are conditionally independent given (E, Mi,M2, . . .)• Suppose that we 
model (Mi,M 2 ,...) as conditionally IID N(^a 2 /X) given (M,A) = (/x,A). The 
empirical Bayes approach might treat (E, M, A) as the parameter to be estimated 
by maximum likelihood. The likelihood function for these parameters is 



n 



,t=i 



2A v^/ Xj-hi/i x + y \ 2 2An / x + y \ 

+ A + 2 2-A 2 2 7 A + 2 V M 2 7 

1=1 

The MLE for M is M = (X + F)/2. The MLE for E 2 as a function of A is 

— -7\ 2" 



.1=1 1=1 X ' 



The MLE of A can be found from 



A + 2 



= mm < 



sr.i(*'-ro a 

\ 2 



A > / \2 r • 

i=1 V 2 2 J J 



Since the "observations" (Xi+Yl)/2 are conditionally IID AT(/i, <r 2 [(l/2) + (l/A)]) 
given the parameters, it follows that 



i=i x 7 



p (A + 2)<7 2 
2A 



This implies that A is consistent and, in turn, that S 2 (A) is consistent. The extra 
terms added due to the empirical Bayes analysis make E 2 (A) consistent (relative 
to the empirical Bayes model). 

It is not required that one use maximum likelihood estimates in a naive 
empirical Bayes analysis. For example, in the one-way ANOVA example, 
we could use 



, V* . ti.,r?r 

T 2 = max { 0, 



i=i * =1 



1 V** „2 

» - 5 E«=i »i 
which are based on unbiased estimators. 



8.4. Empirical Bayes Analysis 503 



8.4.2 Adjusted Empirical Bayes 

It is generally recognized that naive empirical Bayes analyses underestimate 
the variances of parameters because they do not take into account the 
fact that estimated hyperparameters were not really known a priori. For 
example, in the empirical Bayes version of Example 8.14, we treat ^ as if 
it were * and were known a priori. To reflect the fact that we really do not 
know ^ a priori, the posterior variance of Mi should be increased by 

£4 A2 

^Var(tf) = 7 -rr Var(#). 



(71^2+ £ 2 )2 v ' (rii + A) 2 

We would already have an estimate of A from the naive analysis. We could 
use (8.11) with Co = 0 and estimate Var(^) by 

1 ~ l 

Ui 



£ 2 + n . T 2 

The value of this estimate would depend on how we estimated T and £, of 
course. We should also increase the variance of M» to reflect the fact that 
£ and T were estimated. An easy way to do this is to replace the normal 
distribution in the posterior by a t distribution with appropriate degrees 
of freedom. Morris (1983) chooses, instead, to replace the naive variance 
expression £ 2 /(ni + A) by 



£2 / 
rii y 



l-^-^-l (8.34) 
* A + nJ 



This amounts to estimating the shrinkage factor A/ (A + rii) by a smaller 
value. 

Example 8.35 (Continuation of Example 8.14; see page 487). We can estimate 
Var(^) by 

( 10 12 



^38.37032 + 10 x 14.67878 38.37032 + 12 x 14.67878 
+ 38.37032 



15 V 1 

— — ) = 5.95368. 

+ 15 x 14.67878 J 



The additional variance terms for the three groups are 0.25568, 0.19048, and 
0.13112, respectively. The adjustments specified by (8.34) are 3.30693, 2.81623, 
and 2.30494, respectively. Adding these together gives the adjusted variances to 
be 3.56261, 3.00671, and 2.43606, respectively, all somewhat larger than the naive 
variances. 

We might now ask how the adjusted empirical Bayes posteriors compare to 
the posteriors calculated from a hierarchical model with prior distributions for 
all parameters. Such a model exists in the original description on page 487. Plots 
of the posterior densities from these models were drawn in Figure 8.15 together 
with the adjusted empirical Bayes distributions. Two of the Mi have empirical 
Bayes distributions that are very close to the posteriors, but Mi has a noticeably 
smaller variance in the empirical Bayes analysis than in the Bayesian analysis. 



504 Chapter 8. Hierarchical Models 

Table 8.36. Hierarchical Model for One- Way ANOVA with Unequal Variances 



Stage 


Density 


Data 

Parameter 

Hyperparameter 

Variance 


(2*a?r* exp {-^ £i=,M*< - Mi) 2 + (m - 1)4]} 
(2*r 2 H exp {-^ ^(jh - VO 2 } 
V5o(27rr 2 )-i exp - Vo) 2 } 

/s 1 ,...,S k ,T(fl,...,<Tfc,T) 



8.4.3 Unequal Variance Case 

The case of a one-way ANOVA with unequal variances can also be handled 
by empirical Bayes analysis. Suppose that we begin with the model in 
Table 8.36, which is a generalization of the model of Section 8.2.1. The 
posterior mean of for fixed values of the variance parameters and \I> is 

MiW,*i,T)= ^— p (8.37) 
J + t 

The posterior mean of \& for fixed values of the variance parameters is 

-l 



(8.38) 




The resulting likelihood function for T, Si, ... , E fc is r -fc times 



The MLEs of the variance parameters must be either found numerically or 
approximated. Morris (1983) suggests using approximately unbiased esti- 
mates instead. For example, tf = (n< - l)s 2 Jn u and 

_ SUafcftefr-^-.-^-W-g} (8 . 39) 



EL, 



where (8.38) and (8.39) must be solved iteratively. One can choose a start- 
ing f value and plug it into (8.38) (together with £?, . . . , E*) to produce 



8.5. Successive Substitution Sampling 



505 



a ^ to plug into (8.39) to produce a new T, and so on, until the esti- 
mates converge. 8 Morris (1983) also suggests replacing M* by (1 — Bi)xi + 
B 4 *(Ei > ...,E fc ,T), where 



Bi = 



fc-3 E? 
k - 2 t] + rut 2 



causes there to be less shrinkage toward a common mean. 9 The recom- 
mended variance for M* is given as 



rii 



1- 1- 



+ (x i ~$(E 1 ,...,E fc ,f)) 2 



:B 



Bi 



krii 



l (E* + n^ 2 )£L 



Kass and StefFey (1989) present an alternative treatment of this case from 
a Bayesian viewpoint. They find a normal approximation to the posterior 
distribution of the parameters V = (Ei, . . . , E&, \£, T) (in a manner similar 
to the method of Laplace) and then use the delta method to approximate 
the mean and variance of (8.37), thought of as a function of V. The posterior 
variance of Mi is E(E?)/rii plus the variance of (8.37). 



8.5 Successive Substitution Sampling 

The model analyzed in Section 8.2.2 is an example of one that got out of 
hand very quickly, even though it started out in a fairly straightforward 
manner. Another method for finding posterior distributions can be used 
for such models without getting bogged down in such messy calculation. 
The method is a simulation version of the method of successive substitution 
used to solve fixed-point problems. 

8.5.1 The General Algorithm 

In general, if g : A — * A, and we are interested in finding an x such that 
g(x) = x, we could proceed as follows. Pick xo e A. For n = 1,2,..., define 
x n = g(x n -\). If {x n }^ =1 converges and g is continuous, then the limit x 
is a fixed point of that is, g(x) = x. 



The algorithm described here is an example of successive substitution, which 
will be described in Section 8.5. 

9 In the case k = 3 there is no shrinkage, and when k = 2, 1 don't know what is 
recommended, although it seems clear from the formula for the adjusted variance 
that k > 3 is required for this analysis. 



506 Chapter 8. Hierarchical Models 



The type of fixed-point problem we will study is the following. Suppose 
that Y\ , . . . , Yjk are random quantities and that we know the conditional 
distribution of Yi given the others, for each i. Suppose that the conditional 
distribution of Yi given the others has density /^{y^^i} with respect to 
a measure A*. (It will prove convenient to use the notation Y\i to stand for 
{Yj : j ^ i}, so that this last density can be written /y 4 |y x< .) We wish to find 
the joint distribution of (Yi , . . . , Y k ). Suppose that the joint distribution has 
density /y with respect to the product measure A = A x x • • • x X k . Let X' = 
(X{, ...,X' k ) have a distribution with density fx* with respect to A. Define 
the distribution of a new random quantity X = (X\, . . . ^X k ) as follows. 
Suppose that X' = (x[, . . . , x' k ) is observed. The density of X\ with respect 
to Ai is fYi\Y\x i'\ x 2i - - • > The conditional density of X 2 given X\ = x\ 
is /y 2 |y\ 2 ('l x i> x 3? • • • i x 'k)' Continue until we get the conditional density of 
X k given X x = x u . . . , X k -i = Xfc-i to be /y fc |y Nfc (-|xi, . . . ,Xfc_i). In words, 
when we derive the conditional distribution of Xj given X\, • • • >-Xj-i, we 
use the observed values of . . in the conditional distributions. 

When we get to j = fc, we are using only the Xi values. If we define 
z 1 = (4,..., 4), ^ = (a:i,...,x i -i,a; , <+1 ,... J x , fc ) for % = 2, . . . , - 1, and 
= . . . , then the following equation is satisfied: 



fx(x) 



/^(xOdA^ 7 ). 



We can define the operator T from the set of densities with respect to A to 
itself by 

T(f)(x)=f flfY^Xil^) f(x')d\(x'). 

J Lt=i 

It is easy to see that T(/y) = /y , so the joint density of Y is a fixed point 
ofT. 

The method of successive substitution applied to the fixed-point prob- 
lem just described would be to pick an initial density /o, say, and then 
let f n = T(/ n _i) for n = 1,2,.... This would require the calculation of 
a great many integrals that may not have closed-form expressions. An 
alternative is to draw samples from the various conditional distributions 
instead of calculating the integrals. In the notation just used, suppose 
that X' = (xi,...,x' fc ) is generated from the distribution with density 
/*'. Then suppose that X = (X u . . . , X k ) is generated as follows. Gen- 
erate Xi from the distribution with density /y^y^M^ • • • > x k)- Let Xl 
be the generated value. Generate X 2 from the distribution with density 
fY 2 \Y\ 2 ('\ x i> x 3> ' ' -> x k)- Continue until we generate X k from the distribu- 
tion with density /y fc |y Nfc (-|*i, . • . The joint density of X is T(/xO- 
So, we can take a starting density / 0 and generate X° from the distribution 
with this density. Then, using the method just described for n = 1, 2, . . ., 
generate X n from the distribution with density T(/ n -i). This method has 



8.5. Successive Substitution Sampling 



507 



been called successive substitution sampling (abbreviated SSS) because it 
is just a sampling version of successive substitution. 10 

One must, of course, stop the iteration at some point using the sample 
with density T(f n ) in lieu of a sample with density /y. There are several 
ways to prove that SSS converges as n goes to infinity. The following theo- 
rem is proven by Schervish and Carlin (1992). Its proof, which is given for 
completeness, relies heavily on operator theory in Hilbert space. 11 Readers 
unfamiliar with this theory can safely skip over the proof. The necessary 
theorems from operator theory are stated in Appendix C. 12 

Theorem 8.40. In the notation of this section, let 

k 

1=1 

Assume that 

\K(x\x)\ 2 fy^d\(x')d\(x) < oo (8.41) 
Jy(x) 

and that K > 0 almost everywhere with respect to A x A. Let H be the set 
of functions f such that 13 ||/|| 2 = / \f(x)\ 2 / fy{x)dX(x) < oo. There exists 
a number c € [0, 1) such that for every density fo € H, the sequence of 
functions f n = T(/ n _i) = T n (/ 0 ) for n = 1,2, ... satisfies \\f n - fy\\ < 
||/o||c n for all n. 



10 Many authors call this method Gibbs sampling. This is actually a misnomer. 
Geman and Geman (1984) described this method as a way to generate a sample 
from a Gibbs distribution, and they called their particular implementation the 
Gibbs sampler. Gelfand and Smith (1990) generalized the method to arbitrary 
distributions but continued to call it Gibbs sampling, even though they were 
no longer sampling Gibbs distributions. The SSS algorithm is a special case of 
the broad class of Markov chain Monte Carlo methods. Note that the sequence 
{X n }^L 0 is a Markov chain (see Definition B. 125). A good survey of general 
Markov Chain Monte Carlo methods is given by Tierney (1994). 

11 An alternative is to notice that the sequence X\X 2 , . . . is a Markov chain 
(see Definition B.125 on page 650). One then applies a theorem like the one given 
by Doob (1953, Section V.5). The conditions of such theorems are often difficult, 
if not impossible, to verify in specific applications. 

12 Some good treatments of operator theory can be found in Berberian (1961) 
and Dunford and Schwartz (1963). 

13 We use the symbol ||/|| for the norm of an element of a Hilbert space. 
The norm ||T|| of an operator T is the supremum of ||T(/)||/||/||. Dunford and 
Schwartz (1963) use the symbols |/| and \T\ for these norms. They use the sym- 
bol ||T|| for the Hilbert-Schmidt norm or double norm of a Hilbert-Schmidt-type 
operator. We only mention this here in case the reader decides to refer to Dunford 
and Schwartz (1963) for some of the proofs of auxiliary results. 




508 Chapter 8. Hierarchical Models 



Proof. We will use Hilbert space notation and define the inner product 



where /i(A) = XJV/y (x)]d\(x). It follows that H is the Hilbert space 
L 2 (/x). The norm in this space is = y/(g,g)- If we let Kq(x',x) = 
K(x', x)f Y {x'), then T(f)(x) = J Kq\x' ', x)d^{x') is the operator that takes 
a density for observations at one iteration of SSS to the density of obser- 
vations as the next iteration, and (8.41) becomes 



In fact, it is clear that Ko(x',x) is a joint density of two successive it- 
erations of SSS, x 1 and if the first iteration has the solution density 
/y. Furthermore, by writing each of the conditional density factors in 
Kq as the ratio of the joint density /y to a joint density for all but one 
of the observations, and then rearranging the factors, one can show that 
T*(f)(x) = / Ko{x,x')dfji(x') is the operator that takes a density for ob- 
servations at one iteration of SSS to the density of observations at the next 
iteration if the order of updating coordinates is reversed. For this reason, 
it is easy to see that for each g and h in H that are integrable with respect 
to A, 



The last equation is the definition of what it means to say that T* is the 
adjoint of the operator T. It also follows from this equation that the adjoint 
of the composition U = T(T*) is itself, U. That is to say, U is self-adjoint 
Since U is two applications of successive substitution, it follows that 



According to Theorem CIO the operator T is of Hilbert-Schmidt type 
because (8.41) holds. Theorem C.ll says that such an operator is com- 
pletely continuous. 14 It follows then that the adjoint operator T* is also 



14 An operator T is completely continuous if every bounded set B C H is 
mapped by T to a set whose closure is sequentially compact. (That is, every 
sequence in T(B) has a convergent subsequence.) 







(8.42) 



8.5. Successive Substitution Sampling 509 



completely continuous as is U. Since U is self-adjoint and completely con- 
tinuous, H has an orthonormal basis of eigenfunctions of U. Also, Theo- 
rem C.12 says that a self-adjoint completely continuous operator has an 
eigenvalue whose absolute value is equal to the norm of the operator. 

Let V be the operator defined by V(f) = U(f)-f Y (f Y , /). In particular, 
V(f Y ) = 0 because T{f Y ) = T*(f Y ) = h and (/y,/y) = 1. It is easy to 
see that V = W*W, where W(f) = T(f) - /y(/y,/), and W* is the 
adjoint of W. It follows from Theorem C.13 that = \\W* W\\ = ||W^|| 2 . 
The remainder of the proof will be to show that || V"|| and hence \\W\\ are 
strictly less than 1, and then to show that this implies the conclusion to 
the theorem. 

Since V is self-adjoint and completely continuous, we can show that 
||V|| < 1 by showing that the absolute value of its largest eigenvalue is 
strictly less than 1. Let r be the largest eigenvalue of V, which is real since 
V is self-adjoint. Let V(g) = rg. If r = 0, the result holds, so suppose that 
r^O. Then 

(gjy) = l(V(g)J Y ) = \(g,VUv)) = 0, 

since V{f Y ) = 0. Since g is not identically 0, we can write g = g+ - g~ 
where g+ and g~ are respectively the positive and negative parts of g. Let 
B be the set of x such that g(x) > 0, and let C be the set of x such 
that g(x) < 0. Then A(J3) > 0 and A(C) > 0 since (f Y ,g) = 0 but g is 
not identically 0. We will show that \r\ < 1 by means of contradiction. If 
\r\ = 1, then 



V(g) 



= U(9 + -9-) = U(g+)-U(g-) = { [i f r = \ 

{ 9 ~9 + if r = -1. 



Since K 0 > 0, it follows that U{g+){x) > 0 and U(g~){x) > 0 for all x. 
Hence, 

a + (x) < { WOO*) ifr = l, . Yy 
9 { ) I U(g-)(x) ifr = -l, forxeB ' 

a~(x) < / if r = 1, 

5 W < \ U(g+)(x) ifr = -l, forieC ' 

It follows that £/(g+ +jr) > g+ +g" for all at. In other words, [7(| 9 |) > \ g \ 
which would imply JU(\g\)dX >J\g\dX, which contradicts (8.42). Hence, 
|r| < 1 and the largest eigenvalue of V has absolute value Irl < 1. It follows 
that ||V|| = |r| < 1. 

Now, we know that \\W\\ = \ r \W = c < 1. If / is a density, then 
</y,/} = land 

W(f) = T{f) -f Y = T(f - fy). 



510 Chapter 8. Hierarchical Models 

Similarly, if (f Yi g) = 0, then W(g) = T(g) and (f Y ,W(g)) = 0, from 
which it follows that W n (g) = T n (g) for all n. Since (/y , / - /y) = 0 for 
every density /, it follows that, for all n, 

= - fy) = T"(/ - AO = T"(/) - /y. 

So, for all n, 

||T"(/ 0 )-/y|| = ||W"(/o)|| < ||W|ni/o|| = cl/o||. □ 



Although it appears that one needs to know the solution /y in order to 
check the conditions of this theorem, one often knows the function /y up 
to a multiplicative constant. Hence, one could, at least in principle, check 
the finiteness of the various integrals in Theorem 8.40. 

Example 8.43. Suppose that the posterior density of (Vi, V2, Y3) is proportional 
to 

*t \ -4 f 1 [ (2/2-0.9yi) 2 , 2 , ,\\ 

It is not difficult to see that the three conditional distributions are 

AT(0.9y 2 , 0.192/a), 
AT(0.92/i,0.19y 3 ), 

r<-l / o ^ ^3/1 T 0.19 1 



Yi\Y 2 


= !/2, V3 


= 2/3 


Y 2 \Yi 


= l/l,V3 


= 2/3 


Y 3 \Yi 




= 2/2 



V 2 7 

The integrand in (8.41) is a constant times x'^x^ 5 times e to the power 

1 f (xi-0.9x 2 ) 2 , (s 2 -0.9si) 2 j a (a^ -O.Qsl) 2 \ 
"2^\ 4+ 019 + 019 +rEl + 0.19 J 

2x 3 \ 4 + a;i+ 0.19 J 

By collecting terms here, it is not difficult to show that this function is integrable 
over the six variables x\ , x 2 , #3, x'i , x 2 , x 3 . 

After one stops the iteration, one has a vector Y from approximately the 
correct distribution. One can repeat the process and produce Y l , . . . ,Y m 
for some value m. If one wants the marginal density of Y u one can let 
stand for the (fc - l)-dimensional vector formed from Y s by removing Yf, 
and then calculate 

1 m 



m 1 

3=1 



8.5. Successive Substitution Sampling 511 



This estimator is based on the simple fact that, for each s, 



M»)-E(/y 4 |y NI (W|l\ a i)). 



If one wants the mean of Yi, one can calculate 



1 m 

fit 

5=1 



(8.45) 



assuming that the conditional mean of Y* given the others is easily available. 
Equation (8.45) should be better than J2?=i *7/ m > since the variance of 
the simple average is the variance of (8.45) plus 1/ra times the mean of 
the conditional variance of Y* given Y^. Similarly, the variance of Y* can 
be approximated by 



£ £ Var W = Y$ + -J- 5>ttin< = >\ s i) - ^) 2 - (8-46) 



which should be a better estimate than the sample variance of the Y*. 

The SSS algorithm, as described, assumes that the random quantities 
Yi,...,Yfc are in a fixed order for every iteration. This is not actually 
required for convergence of the algorithm. The proof of convergence is sim- 
plified by making this assumption however. Note also that each Yi need not 
be a single random variable. Some of them might themselves be vectors. 
The question of how to arrange the coordinates is important for the rate 
of convergence. The more dependence that exists between successive itera- 
tions, the slower the convergence will be. One can understand why this is 
true intuitively by realizing that convergence "occurs" when an iteration is 
"independent" of the starting iteration. The more dependence lingers from 
one iteration to the next, the longer it takes to get an iteration that is 
essentially independent of the start. Example 8.43 can be used to illustrate 
how the choice of coordinate arrangement affects the dependence between 
iterates. 

Example 8.47 (Continuation of Example 8.43; see page 510). It is not difficult 
to see that every order of the three coordinates is essentially equivalent in this 
example. Instead, let us compare the natural order Yi,Y 2 , Y3 to the alternative 
arrangement Xi = (Yi, Y 2 ) T , X 2 = Y 3 . That is, let the first random quantity be 
a two-dimensional vector consisting of both Yi and Y 2 . To illustrate the effect of 
this change on the amount of dependence between iterations, we will calculate 
the conditional distribution of Yi at the next iteration given the variables at the 
current iteration for both arrangements. 

In the natural order, we generate Yi with iV(0.9i/ 2 ,0. 19y' 3 ) distribution given 
Y/ = Y 2 ' = y' 2 , and Y 3 ' = y 3 . In the vector arrangement, we generate the whole 
vector (Yi, Y 2 ) at once with distribution N 2 (0,y 3 i4) where 



3=1 




512 Chapter 8. Hierarchical Models 



The conditional distribution of Yi given Y{ = y[, Y 2 ' = 2/2 > and ^3 = 2/3 ls N(0, 2/3) 
in this case. The dependence on the previous iteration is greatly reduced in the 
vector arrangement. In the natural order, Yi is much more constrained by the 
values y' 2 and y 3 than in the vector arrangement. A similar calculation shows 
that, in the natural order, the conditional distribution of Y2 at the next iteration 
given Y{ = yi, Yi = y 2 , and Y 3 ' = t/ 3 is AT(0.8l2/ 2 ,0.3439y 3 ), while in the vector 
arrangement it is ]V(0, y 3 ). Although Y2 is less dependent on the previous iteration 
than is Yi, it is still more dependent in the natural order than in the vector 
arrangement. 

As a rule of thumb, if one knows that several random variables are highly 
dependent, it will be better, if possible, to treat them as a single random 
quantity in the SSS algorithm rather than to treat each one as a separate 
coordinate. 



8.5.2 Normal Hierarchical Models 

Take the model in Section 8.2.1 as an example. The vector Y in the dis- 
cussion of this section will be the collection of all parameters of the model, 
namely Mi, . . . , M*, \£, £ 2 , T 2 . We will use the prior in which E 2 and T 2 
are independent with inverse gamma distributions r~' 1 (ao/2,feo/2) and 
r~ 1 (c 0 /2,d 0 /2), respectively. All distributions will be conditional on the 
data. It is easy to calculate the various conditional distributions we need. 
In the following list, each distribution is to be understood as conditional 
on both the data and on all of the other parameters. 



Mi ~ N 1 



* ~ JV 



r _i (c o + k + 1 4 4-Eti(^-^) 2 ^Co(^~^o) 2 



r 



2 ' 2 

_x / Qq + ni + h n k 

2 



2 ^ ; 

It is easy to generate pseudorandom numbers from each of the above dis- 
tributions, so that SSS could be implemented without much trouble. 

Example 8.48 (Continuation of Example 8.14; see page 487). We used naive 
empirical Bayes estimates of the parameters as starting values and then ran 
twenty thousand iterations, taking every twentieth iteration as a sampled value. 
We also ran forty thousand iterations where we took every fortieth iteration as a 
sampled value. The differences were negligible. Then we calculated (8.44) for each 



8.5. Successive Substitution Sampling 513 



of the three population means. These densities are plotted in Figure 8.15. The 
respective posterior means were calculated to be 26.30, 18.78, and 19.82 using 
(8.45). The posterior variances were calculated using (8.46) to be 4.525, 2.998, 
and 2.387, respectively. 

The SSS algorithm is also well suited to handle the case in which each 
population has its own variance, E 2 . Suppose that the E 2 are independent 
with r _1 (ao/2,6o/2) distribution in the prior. In this case, we replace two 
of the above distributions with 



Mi ~ N 



err t or 1 t 2 



.1 / ap + n j bp + (rij - l)g| + rii(xi - Hi) 2 \ 

v 2 • 2 



This model has fc — 1 more parameters than the equal variance model. 

An intermediate case between equal variance and independent variances 
is to have a hierarchical model for the variances. Suppose that the E 2 
are conditionally independent with r _1 (ao/2,aocr 2 /2) distribution given 
E 2 = a 2 . Then suppose that the prior for E 2 is T(f 0 /2, g 0 /2). This model 
has fc more parameters than the equal variance model, and the conditional 
distributions required for SSS must change to include the distributions 
just given for the other unequal variance model together with 

V 2 1 f a o + n i a o° 2 + {rii - l)sf + ni(xi - /i*) 2 \ 
S * ~ r [~2- ' 2 J' 

Example 8.49. Suppose that we use the same data as in Example 8.14 on 
page 487, but we use the hierarchical model for the population variances. We 
continue to use ao = 1, Co = 1, do = 1, ip 0 = 10, and Co = 0.1, but we include 
/o = 1 and 0o = 0.1. The resulting posterior densities are plotted in Figure 8.50. 
The posterior means and variances from (8.45) and (8.46) are 



i 


1 


2 


3 


mean 


27.1392 


18.7672 


19.7399 


variance 


3.1634 


4.6745 


2.2667 



We could also do a naive empirical Bayes analysis. (Recall that the adjusted 
empirical Bayes analysis requires k > 3.) We will use E 2 = sf. We will also adjust 
the variance of M» by adding on T 2 times the square of the coefficient of tp in 
(8.37). This results in M» having variance 



3? 



(E^ + niT 2 ) 2 , 



We need to iterate between (8.38) and (8.39). Starting with T = 1, it took four it- 
erations to get no difference between the iterations. The results were 4> = 21.9809 



514 Chapter 8. Hierarchical Models 




i i — i i , 1 1 

10 15 20 25 30 35 

Figure 8.50. Numerical Approximations to Posterior Densities 

and T = 23.4547. This leads to the following naive empirical Bayes posteriors: 
JV(27.3786, 2.5817), AT(18.8116, 5.4845), and iV(19.7526, 2.3257). These three den- 
sities are also plotted in Figure 8.50. Notice that the hierarchical model for the 
variances brings the estimated variance for the second population (the largest 
of the three) down quite a bit from the empirical Bayes value while it brings 
the other two variances up. Although the hierarchical model variance for M3 is 
slightly larger than the empirical Bayes variance, the density (in Figure 8.50) is 
more peaked. The additional variance comes from heavier tails. 

The more complicated two-way ANOVA described in Section 8.2.2 can 
be handled using SSS in a much simpler fashion than the analytical dis- 
cussion in Section 8.2.2. Let the parameters be M, A», Bj, (AB)ij (for 
i = 1, . . . , a and j = 1, . . . , 6), E^, E^, E^, and E^ B . We also have the 
constraints (8.17). Because of the constraints, we cannot apply SSS in the 
most naive manner. For example, the random variables Bi, . . . , B& have the 
property that the conditional distribution of each one given the others is 
concentrated on a single value (namely, minus the sum of the others) with 
probability one. This is an extreme case of dependence. No matter what 
starting values one generates for Bi, . . . ,B*, one will never change them, 
no matter how many iterations one performs! Clearly, convergence cannot 
occur in this case. 15 There are two ways to circumvent this problem. One 
is to drop one of the parameters from the algorithm and just calculate it 
when needed. That is, just calculate B 6 = - £$=1 when it appears in 



15 In fact, the condition that the function K be strictly positive in Theorem 8.40 
is violated. 



8.5. Successive Substitution Sampling 515 



some conditional distribution, but treat Bi,...,Bfc_i as the parameters. 
Another approach is to treat the entire vector B = (Bi,...,B[,) as one 
parameter (one of the Yi) as in the vector arrangement in Example 8.47 
on page 511. Also, (AB)* = ((AB)i ? i, . . . , (AB)^) could be treated as one 
parameter for each i. We choose the vector approach here because the con- 
straints introduce dependence among the coordinates, which will slow down 
convergence. 

Suppose that our model says that M, Ai, . . . , A a , B, (AB)i, . . . , (AB) a 
are all conditionally independent given Eg,E^,E^, and Y? AB with 

M ~ W(0 O ,^), B ~ N b {0,al [/-±11 T ]), 
.A* ~ ATM), (AB), ~ JV 6 (0,^ B [J-$11 T ]). 

This produces the same model as in Section 8.2.2. Next, assume that the 
variance parameters are independent with inverse gamma distributions 

Note that the prior distributions of the constrained parameters are the 
conditional distributions of independent random variables given that the 
constraint holds. That is, for example, if Bi, . . . , B& were IID AT(0, a 2 B ) and 
we found the conditional distribution of B given that X)j=i = 0, that 
conditional distribution would be the distribution given above for B. (See 
Problem 13 on page 535.) Also, if r = 6, then this model is the same as 
saying that the B^ are IID AT(0, <r%) and M = 

The posterior distributions of the parameters conditional on the other 
parameters are now easily calculated after we introduce some notation. 
Suppose that there are riij observations with the A factor at level i and 
the B factor at level j. We do not need to assume that the cells all have 
the same sample size as we did in Section 8.2.2. In fact, there can even be 
empty cells in this analysis. Define 



n.j 








n. j Si=l ]Cfc=l 2/t,j,fc> 


n it . 








^~ Sj=i 2fc=i 


n 

• >• 






5.,.,. = 


n. Z^j=l Z-a=l Z_>fc=l yi,j,k 


5. 






= 










(«/»).,. = 




& 










ft 










w 




Z^i=i l^j=i Zwfc=i 




.) 2 -' 



516 Chapter 8. Hierarchical Models 



By our seeking the conditional posterior distribution of one parameter given 
the others, the data can be viewed as having a simple structure. For exam- 
ple, if we want the conditional posterior distribution of M given the other 
parameters, we construct Y*- k = - At - Bj - (AB)^ . Then the Y* jk 
are IID iV(/i, cr^) given M = /i. The conditional posterior distribution of M 
can easily be shown to be 



n -,.(y. . 



M~ N 



lit 



n ,. i r 



A similar analysis works for each A». The result is 



Ai~N 



( ni,.(y < ,.,.-f*-/9 i -(a/?) it .) 
^ 



V 



+ 



Similarly, if we want the conditional posterior of the vector (AB)i given 
the other parameters, we construct Y*- k = Yij,k - M — A* — Bj. Then 
the Y* jk are independent with Y* jk having N((a/3)ij,al) distribution 
given (AB)i = ((ap)^. . . , (a/?)j,b) T . The posterior can be found by first 
calculating the posterior as if the parameters were unconstrained and then 
conditioning on the constraint as in Problem 13 on page 535. Define the 
6-dimensional vectors 



z = 



e u AB n 



Let diag(t/) stand for the diagonal matrix with diagonal entries equal to the 
coordinates of v\ Then the conditional posterior of (AB)i is iV 6 (CV ,C % ), 
where C* = diagfa*) - vV T /(1 T V). A similar analysis works for the B 
vector. In this case, define the vectors 



( 



\ 



The posterior of B is N b (Cz, C), where C = diag(u) - w T /(l v). 

Since the variance parameters (except for £*) are conditionally indepen- 
dent of the data given the other parameters, their conditional posteriors 
will not depend on the data. We can calculate the conditional posteriors 



8.5. Successive Substitution Sampling 517 



making use of Proposition 8.13: 



^ r 2 J' 

s B ~ r y—f-, j, 

^2 i (a-AB + ab-a b AB + £i=i Ej=i( AB )L A 

~ r ^ , y 

and Eg has distribution 

1 / Qe-fn,,, *>e + W + E^l Ej=l n ij(ViJ,. -H-CXi-Pj- (<X0)ij) 2 



r 



The only remaining problem for implementing SSS in this example is 
how to simulate from a multivariate normal distribution Nb(Cz, C) with a 
singular covariance matrix C = diag(v) — vv T /(l T v). The most straight- 
forward way is to find a b x (b - 1) matrix D such that DD T = C, then 
generate an jV&_i(0, /) vector V, and use D(V + D T 2). Let be the first 
6—1 coordinates of v, let be the vector with jth coordinate y/v], let 
diag(y^) be the diagonal matrix with (j, j) element equal to y/v], and let 



Then the following matrix satisfies DD T = C: 



/ diag {y/i£) - hv+y/vl J 



One should also note that, just as in the one-way ANOVA, we could 
have had unequal variances in the cells. That is, we could have had the 
conditional variance of Yij,k be Ef^ instead of Eg for all i and j. This would 
have introduced ab — 1 additional variance parameters, but the conditional 
distributions would have been only slightly more complicated. The serious 
reader should work this case out in detail. In addition, a hierarchical model 
for the E? j could be introduced. Intermediately, one could model the Yij^ 
as having variance E? or E^ so that some cells have the same variance and 
others do not. All such models can be handled in nearly the same fashion 
as above. 



8.5.3 Nonnormal Models 

There is a large class of problems to which the SSS methodology can apply. 
We will not attempt to catalogue this class. We give only a few more exam- 
ples to show how, with a little imagination, the methodology can apply even 



518 Chapter 8. Hierarchical Models 

where one would not normally think. In Example 7.104 on page 444, the 
observables Xi were modeled as having Cau(0, 1) distribution given 0 = 0. 
Suppose now that we introduce an extra parameter Y{ for each observation 
and say that Xi given Yi = y and 0 = 0 has iV(0, y) distribution and the Y* 
are independent of 0 and of each other with I 1-1 (1/2, 1/2) distribution. 16 
It follows that Xi ~ Cau(d, 1) given 0 = 0; hence this model is equiva- 
lent to the original model. However, this new model is easily handled via 
SSS. In Example 7.104, we supposed that 0 had JV(0, 1000) as a prior. The 
conditional posterior of 0 given the Y* is 



The conditional posterior of Y* given 0 is r _1 (l, [1 + (x< - 0) 2 ]/2). 

Example 8.51 (Continuation of Example 7.104; see page 444). After 40 itera- 
tions, we constructed 10,000 vectors of the 11 parameters Yi, . . . , Yio, 0. The 
estimated mean and variance of 0 were 4.585 and 2.233, respectively. The poste- 
rior density of 0 is plotted in Figure 7.105 on page 444 together with the normal 
approximation of Theorem 7.101 and an approximation by numerical integration. 

The same principle as described here can be used if one wishes to use a 
Cauchy or t distribution for the prior distribution of the location param- 
eter for normally distributed data. In fact, a t distribution for the prior 
combined with data having t distribution can be handled using SSS and 
simple normal/inverse-gamma posteriors. 

The next example is the model described in Section 8.3.2 in which there 
are k groups of subjects with subjects in group i. The data are Xi = Xi, 
the number of subjects with a positive response to some query. The Xi are 
modeled as conditionally independent Bin{rii,pi) given P\ = pi, . . . , Pk = 
p fc . The Pi are modeled as conditionally independent Beta(0r, [1 - 0]r) 
given 0 = 0, R = r. Finally, we will suppose that 0 and R are independent 
with discrete prior distributions having densities /e and /r with respect to 
counting measures on sets {0i, . . . , 0 a } and {n, . . . , r 6 }, respectively. The 
conditional posteriors of the Pi given the other parameters were already 
seen to be Beta(6r + x» [1 - 9]r + n< - x { ). The posterior of © given the 
others has probability of © = 9j proportional to 




k 



This distribution is also known as Xi- 



8.6. Mixtures of Models 



519 



The conditional posterior probability of R = Tj given the other parameters 
is proportional to 

/«(r J or(r i ) fc r(^o- fc r([i-«]r J o-*n^ 9 " 1 ( 1 -ft) r,u " flI " 1 - 

i=l 

Simulating random variables with a discrete distribution can be done by 
the following tedious but straightforward method. Let X have density fx 
with respect to counting measure on the set {x\ , £2, • • •}• Generate a U (0, 1) 
variable U. Set X = Xj where j is the first n such that ^=1 fxi^i) > U. 
This is the discrete version of the probability integral transform. 

As an alternative to sampling from discrete distributions, we could in- 
troduce some latent variables that make the problem look like a normal 
hierarchical model. 17 Let Xi = Y^jL\ ^*,j> wnere the Xij are modeled as 
IID Ber(pi) given Pi = Pi for each i. Let Zij be IID with N(iii, 1) distribu- 
tion given Mi = /z* where Pi = $>(Mi), and assume that Xij — I]^ i00 ){Z iy j). 
We can treat the Zi j as parameters or missing data. Let the prior for the Mi 
be that they are conditionally IID with iV(/i, r 2 ) distribution given M = /i 
and T = r. Let T 2 have an inverse gamma distribution. Either let M be 
independent of T with N(fio^o) distribution or let M given T = r have 
N(fio y T 2 /Xo) distribution. The conditional distribution of the Z i} j given 
the Mi, M, T, and the X i} j is that of independent, truncated AT(Mi,l) 
random variables. (Those Zij corresponding to Xij = 1 are truncated to 
the interval [0,oo) and the others are truncated to the interval (-00, 0).) 
The conditional distribution of the M, given the Zij, the Xij, M, and T 
as well as the conditional distributions of M and T given the others are all 
obtained as in the appropriate normal hierarchical model. 

8.6 Mixtures of Models 

8.6.1 General Mixture Models 

A different type of hierarchical model is one in which one contemplates 
several different models for the same data but does not wish to condition 
on just one of them. For example, consider a case in which one observes pairs 
(Xi,Yi) and one wishes to predict the Y coordinate from the X coordinate 
(often called regression). One typical model is that there are parameters 
6 = (B 0 ,Bi,E) such that conditional on 0 = {Po,Pii&) and X = x, Y ~ 
N(/3q + a 2 ) and X is independent of 0. Another model says that there 
are parameters 0 = (Bo,Bi,E) such that conditional on 0 = (/?o,/?i,0") 
and X — x, \og(Y) ~ N(/3 0 + /?ix,cr 2 ) and X is independent of 0. The two 



17 This model and some generalizations of it are discussed by Albert and Chib 
(1993). 



520 Chapter 8. Hierarchical Models 



0s are not the same random quantities. In fact, it is common to believe 
that at most one of them actually exists. Let * be a random quantity 
such that conditional on 9 = 0, there is a 0o such that conditional on 

0 0 = (/fo,/?!, or) and X = x, \og(Y) ~ N(/3q + fax, a 2 ) and conditional on 
9 = 1, there is a ©i such that conditional on ©i = (/3o,/?i,<r) and X = x, 
Y ~ N(/3q + fax, a 2 ). In a sense, the parameters are now (9, ©o, ©i), but 
the joint distribution of ©o and ©i is of no interest since there are no 
data that depend on both of them. In fact, we don't even need to believe 
that they coexist. One would need to specify a prior distribution for 9, a 
conditional prior for ©o given 9 = 0, and a conditional prior for ©i given 
9 = 1. After observing data, one could calculate conditional posteriors for 
©o and ©i given 9 = 0 and 9 = 1, respectively. One could also construct 
prior predictive distributions for the data given 9 alone and use these to get 
the posterior for 9. In symbols, we need /e 0 |^(^o|0), and /ei|*(^i|l)» 
The original models give /y,x|e 0 an d /y,x|0i> where we assume that 

/y,x|e o (2/^|0o) = /y,x|e 0 ,* (y,x\0o,0) = /y f x|e 0 ,ei,*(y»^o^i,0), 

/y,x|ei(2/»s|0i) = fY,x\e x ,*(y,A e i^) = /y f x|e 0 ,ei > *(l/,^o,^i, 1), 

so that (y, X) is conditionally independent of ©i_i given ©, and 9 = i for 

1 = 0, 1. The predictive density of (Y, X) given 9 is 

where is the parameter space given 9 = ip y for ^ = 0, 1. The conditional 
posteriors are 

/N /y t xie^(y»g| ^)/e^|^(^W 
/e,|x,y^(^l^2/^) / yfX| »(y,a#) ' 

for ^ = 0, 1. The posterior of 9 is 

h\xA^y) - / Xim (x,y|0)/*(0) + /x,y|*(x,»|l)/*(l)' 

If there are future data (y',X') that are conditionally independent of 
(y, X) given the parameters, then predictive inference is available. 

l 

i/;=0 

V;=0 ^ 



8.6. Mixtures of Models 



521 



Notice that the predictive density fy ,x'\y,x is a weighted average of the 
two predictive densities one would have used if one had believed each of the 
two models. The weights are the posterior probabilities of the two models. 
If, for example, model 0 looks orders of magnitude better than model 1 
based on the (Y, X) data (that is, /*|x,r(0|a?,2/) is much much larger than 
/*|x,y(l|z>y))> then the predictive distribution of the future data will be 
almost the same as if only model 0 had been used from the start. 18 The 
real advantage to this approach arises when neither model appears much 
better than the other based on the data. In this case, we can hedge our 
predictions to allow for the possibility that one or the other model will later 
turn out to appear better. 

Of course, the above description can be extended to apply to arbitrary 
data X and an arbitrary number of models. For example, the two models 
in the regression example considered above can be embedded in a family 
of models in which, conditional on \I> = — (/? 0 , and X = x, 

- 1 

- AT(/3 0 + /3ia;,(7 2 ), 

w 

where ip = 0 is defined by continuity (taking a limit). This is the familiar 
Box-Cox family of transformations introduced by Box and Cox (1964). If 
uncountably many values of i\) are being considered, the sums over tp must 
be replaced by integrals that, presumably, must be evaluated numerically. 

8.6.2 Outliers 

One popular use of mixtures of models is to allow for the possibility of 
outliers in a data set. An outlier is an observation whose distribution (before 
seeing the data) is not like that of the other observations. This "definition" 
of outlier is intentionally vague. Consider an example to help to clarify the 
concept. 

Example 8.52. Suppose that Xi, . . . , X n are potential observations, but we be- 
lieve that some of them may not have the same distributions as the others. Box 
and Tiao (1968) describe a model similar to the following. Let 0 = (M, S) and 
suppose that the conditional distribution of each Xi given 0 = (/i, a) is N(n, a 2 ) 
with probability 1 — a and is N(/x, ca 2 ) with probability a, where c > 1 and a are 
constants chosen a priori. Suppose that the conditional distribution of M given 
E = a is AT(/io,a 2 /Ao) and E has r~ 1 (oo/2,6o/2) distribution. There is missing 
data here, namely the indicators of whether each observation has variance ca 2 or 
not. Let ^ stand for the subset of {1, . . . , n} such that the observations with sub- 
scripts in \I> have larger variance. For each possible value ip of \&, let indicate 



18 The predictive density has been used by many authors as a means for select- 
ing and comparing models. See Geisser and Eddy (1979) and Dawid (1984) for 
different perspectives. 



522 Chapter 8. Hierarchical Models 



the number of elements of Let 

r i in*®, 

Zi ~\\ ifie*. 

Then /xi^OeIVO equals a constant times 

bo + *(a* - x^) 2 + ^2^1* (^ - ^o) 2 J , (8.53) 

where 

\ \ . - ZiXi 
Ai = Ao + Z% i X ^ ~ X~^n • 

i=l ** 

If we model the Zi as independent, then the prior probability of # = ^ is a n ^ (1 - 
a) n-n ^ , so we can calculate the posterior probability of each possible subset of 
outliers. 

For example, suppose that n = 15 and our prior has a 0 = 1, 6o = 100, a*o = 0, 
Ao = 1, ol — 0.02, and c = 25. (For computational simplicity, we truncate the 
distribution of ^ to a maximum of six elements.) The data are the infamous 
Darwin data [see Fisher (1966), p. 37] shown below 

-67, -48, 6, 8, 14, 16, 23, 24, 28, 29, 41, 49, 56, 60, 75 

The possible values of ip with the highest posterior probability are given in Ta- 
ble 8.54 under the column "Model 1." The set of size three with the highest 
posterior probability of being outliers is {1,2,15} with probability 0.0030. We 
can also add the probabilities of all subsets that contain a specific observation to 
get the marginal probabilities that each observation is an outlier. See Table 8.55. 
The observations not listed in Table 8.55 each have probability less than 0.005 
of being an outlier. 

Of course, one need not choose a single value of c or a single value of a. One 
could treat these as further mixing parameters like ^ and compute a posterior 



Table 8.54. Posterior Probabilities of Outlier Sets 



{1} 

{1,2} 
{2} 
{15} 
{14} 
{13} 
{12} 

{11} 
{3} 
{1,15} 

{4} 



Model 1 Model 2 



0.7192 
0.1295 
0.0483 
0.0241 
0.0114 
0.0060 
0.0052 
0.0043 
0.0036 
0.0033 
0.0032 
0.0032 



0.7767 
0.0885 
0.0448 
0.0166 
0.0080 
0.0042 
0.0037 
0.0031 
0.0026 
0.0023 
0.0031 
0.0023 



8.6. Mixtures of Models 



523 



Table 8.55. Posterior Probabilities of Outlier Observations 



i 


1 


2 


11 


12 


13 


14 


15 


Model 1 
Model 2 


0.1970 
0.1627 


0.0816 
0.0803 


0.0050 
0.0047 


0.0060 
0.0057 


0.0075 
0.0074 


0.0088 
0.0089 


0.0195 
0.0214 



distribution for them. For example, (8.53) is now /x|*,c,a(^|^,c,q), which does 
not depend on a. This is because X is conditionally independent of A given \I> 
and C. Suppose that we let A have a Beta distribution. It is difficult to have a 
Beta distribution with mean 0.02 which is neither extremely concentrated near 
its mean nor extremely concentrated near 0. Suppose that we choose Beta( 1,49), 
which has Pr(A < 0.02) = 0.6358. Also, let C have probability 0.05 of being one 
of the numbers 5, 10, ... , 100. The posterior distribution of C is almost the same 
as the prior, meaning that the different values of C do not lead to much difference 
in the predictive density of the data, although C = 10 has the highest posterior 
probability. The posterior distribution of A has mean 0.0204. The posterior prob- 
abilities of the various V sets is given in Table 8.54 under the column "Model 2." 
The probability of the set {1,2,15} is now 0.0058. The probabilities that each 
of the observations is an outlier is in Table 8.55. Although the probability that 
^ = {1} is smaller in Model 2, the probability that observation 1 is an outlier is 
still quite large, because there are many other sets ip containing 1 which now have 
higher probability of equaling ^. For example, the probability of three outliers 
is twice as high in Model 2 as in Model 1, and the probability of four outliers is 
six times as high. 

Before we leave this example, we offer another variation. Suppose that we 
give C a continuous prior distribution, say r -1 (co/2,do/2) truncated below at 
c = 1 and independent of (E, M, A). We could find the posterior distributions of 
whatever we wanted by using successive substitution sampling (see Section 8.5). 
The following conditional posterior distributions are easy to find and are easy to 
simulate: 



E 2 - r 



I 2 ' 2<r 2 r 

i / ao 4- n -hi 



2 



M ~ N 



Ao -f n — ny, + ' Ao 4- n — + \rixi, J ' 

A ~ Beta(pto + rty,/3o 4- n - rty), 

a \ 



Zi ~ Ber 



a + y/c(l - a) exp (-f^( Xi - /i) 2 ) 



where, as before, \£ = {i : Zi = 1}, and the distribution of C is still truncated 
below at c = 1. Since the inverse of the T CDF is available in many subroutine 
libraries, the truncated distribution can be simulated using the probability inte- 



524 Chapter 8. Hierarchical Models 



gral transform. Notice that the Zi are independent of each other given the other 
parameters and the data. An analysis like the one just described is developed by 
Verdinelli and Wasserman (1991). 

Notice that in the last variation, the model no longer resembles a mixture of 
models. In fact, it is just a more highly parameterized model with parameter 
(£, M, \£, C, A). Alternatively, the parameter could be taken as (E, M, C, A) with 
Z being considered as missing data. 

This example is not meant to be a prescription for how to handle out- 
liers, but merely an example of how mixtures of models can be used for 
such a problem. Freeman (1980) describes several other methods for detect- 
ing outliers. West (1984) describes hierarchical models for accommodating 
outliers in linear regression. 

In fact, any situation in which there is uncertainty is amenable to analysis 
using a mixture of models. Even the simplest univariate one-sample prob- 
lem can admit several prior distributions and/or parametric families. The 
different combinations of parametric family and prior distribution can be 
mixed using the general theory outlined above. See Problem 14 on page 75 
for a simple example. 

8.6.3 Bayesian Robustness 

In Section 5.1.5, we introduced M-estimators as robust estimators that 
might be less sensitive to anomalies in the data. Because Bayesian solu- 
tions depend on prior distributions for parameters in addition to the con- 
ditional distributions of data given parameters, one might be interested in 
prior distributions that provide a measure of robustness. Also, one might 
be interested in the degree of robustness that a particular choice of prior 
exhibits when compared with several others. 

A straightforward way of comparing a particular prior distribution fie to 
several others is to compute whatever one would normally compute using 
fie as the prior and then recompute the same quantities using all of the 
other priors. This activity often goes by the name of sensitivity analysis. If 
the number of alternative priors is too large, one might be able to compute 
bounds on the various quantities of interest as the prior ranges over the 
alternatives. One popular way of specifying a set of alternative priors is by 
means of e- contamination. For a given fie, c > 0, and set C of probability 
measures on (ft,r), one forms the collection 

C € = {{l-e)fie+en:neC} (8.56) 

of alternative prior distributions. The set C e is called an e-contamination 
class. If fie € C, then fie € C e also. Note that each element of C € is a 
mixture of two possible prior distributions. The largest set C one could 
use is the set of all probability distributions on (ft, r). Suppose that one is 
interested in posterior probabilities of sets C er.lt is possible to calculate 



8.6. Mixtures of Models 



525 



bounds on the posterior probabilities of such sets as the prior ranges over 
C € . 

Theorem 8.57. 19 Suppose that X has conditional density fx\e(%\6) with 
respect to v given 0 = 0. Let C be the set of all distributions on (ft, r), 
and let C e be as in (8.56). For each n 6 C ei let ir(-\x) denote the posterior 
distribution ofO given X = x calculated as ifn were the prior distribution. 
Similarly, let /J>e\x('\x) denote the posterior calculated as if fie were the 
prior. For each C e r, 

inf *(<?!*) = 7 

sup.(C|x) = 1 a-')/x(»)[l-Pe|x(g|x)] 

»6C« (1 - e)f x (x) + esup fl6C fx\e(x\0) ' 

where f x denotes the marginal density of X under the assumption that n& 
is the prior. 

Proof. For tt e C e with tt = (1 - e)fi e + er?, it is easy to see that 

x(C\z) = { - - WxWtoixMz) + ej c f xie (x\6)dri(0) 

(l-e)fx(x) + eg(x) ' {8 - 58) 

where g{x) = f f X \e(x\0)d<n(9) is the marginal density of X under the 
assumption that n is the prior. The expression in (8.58) will get smaller if n 
is replaced by any rf such that r/*(C) = 0 and n*(D) > 77(D) for D C C° 
(For example, rescale t,(. n C c ) to be a probability.) It follows that the 
smallest values of (8.58) occur when V (C) = 0. When n(C) = 0 (8 58) 
becomes ' v ' 

' (l-*)fx(x) + eg(x) ■ ( 8 - 59 ) 
This can be minimized by making g{x) as large as possible. But since 
9(x) is an average of values of f X \ e (x\9), its supremum is clearly equal 
to rappee; fx\e(x\6). So the infimum of n{C\x) equals (8.59) with g(x) 
replaced by sup 96CC f X g(x\8). The supremum is obtained by apply J 
the same argument to C c . 



□ 



Example 8.60. Let X ~ Exp(0) given 6 = 6, and let p e be the IYa 6) distri- 

arSu^of ^ t? CrVal % C(X)1 WherCC(a; ? -^e/ q uantileof tLepLtSor 
distribution of Q. In this case, the posterior is T a+1, b+x) and the marginal den 

which it' 1 fX a iX \l The likelihood functS K^-S^ 

which increases for 0 < 1/x an d decreases thereafter. So, we have 

sup/x|e(*|0) = I ° J *£ 21 if J<c(x), 
9ec I c(x)exp(-c(x)x) if r > c(x). 



9 This theorem appears in Berger (1985). 



526 Chapter 8. Hierarchical Models 



sup fx\e{x\6) 
eec c 



-{ 



=a*=ll if i > c{x)t 

c(x) exp(— c(x)x) if \ < c(x). 



The value of fJ>Q\x(C\x) = 7 by design. The bounds given by Theorem 8.57 for 
the e-contamination class using C equal to all distributions are, for l/x < c(x), 



(1 ~ e ) (6+° + ec ( x ) exp(-c(x)x) 



< ir(C\x) < 1 



(l- e )(6^r+^^ 



For example, with 7 = 0.5, a = 6 = 1, and e = 0.1, we have c(x) = 1.678/(1 + 
x), which is greater than or equal to l/x for x > 1.474. Figure 8.61 shows a 
plot of the lower and upper bounds on the posterior probabilities of the interval 
[0, 1.678/(1 + x)] as a function of x. Notice how the degree of robustness depends 
on the observed data. When x is very small, the likelihood function is quite large 
for large values of 9 outside of the interval (0,c(x)], since c(x) never gets bigger 
than 1.678. A prior that assigned probability 1 to such a large 0 value would be 
consistent with a very small observed x and would give low probability to every 
subinterval of [0, 1.678]. If such priors seem unreasonable, then perhaps the class 
C € is too large. 

Additionally, we may wish to find bounds for the posterior mean of a 
measurable function g of 0 as the prior distribution varies over a class 
such as an e-contamination class. The following theorem, which is helpful 
in this regard, is due to Lavine, Wasserman, and Wolpert (1991, 1993). 




Figure 8.61. Lower and Upper Bounds on Posterior Probabilities 



8.6. Mixtures of Models 527 



Theorem 8.62. Let T be a class of prior distributions on (ft, r), and let g : 
ft :— > 2R be a measurable function. Suppose that inf ne r f fx\e( x \9)d7r(6) > 
0. For each n € T, define s„(X) = / fx\ei x \0)[9{Q) - A]d7r(0), and Ze£ 

5(A) = sups^A). 

Then for finite A, the least upper bound on the posterior means of g(Q) is 
A if and only ifs(X) = 0. 

Proof. Let 

f fx\e(x\0)9{e)d*(9) 
°~™r J/x|e(*|*)*r(*) ' 

and assume that Ao is finite. For the "if" direction, suppose that s(A) = 0. 
We need to prove that A = Ao. Since s(A) = 0, we know that s n (X) < 0 
for all 7T £ r and that there exists a sequence {7T n }^ =1 of elements of T 
such that, for each ra, s nn (X) > —1/n. This last claim can be written as 
S fx\e{x\0)g{6)dn n (e) > X f fx\e{x\O)d7r n {0) - 1/n, which implies 

! fx\B{x\0)g(0)d* n {0) 1 

I fx\e(x\e)d7r n (d) nff xle (x\0)dir n (ey 

for all n. We know that 

J f x{e (x\9)g(9)dn n (9) i (Rm . 

°" T ffx\e(x\9)dn n (e) " su Pn / f x p(x\9)d* n (0)' ^ 

Because inf^r / fx\e( x \0)d7r(9) > 0, the far right-hand side of (8.63) 
equals A, so Ao > A. We can rewrite s n (X) < 0 as / fx\Q{x\6)g{9)d'n{d) < 
A / /x|e(#|#)d7r(0), which implies 

f fx\e(x\9)g(9)dn(9) 
f fx\e(x\9)d*(B) ~ • 

Since this is true for all 7r G T, it follows that Ao < A, and we conclude 
Ao = A. 

For the "only if" part, we must show that s(Ao) = 0. From the fact that 

f fx\e(x\9)g(9)dw(9) 
J fx\e(x\9)dn(9) " °' 

for all 7r G T, it easily follows that s(A 0 ) < 0. Suppose that s(Ao) = — c for 
some e > 0. We will derive a contradiction. We know that there exists a 
sequence {^n}^Li of elements of T such that, for each n, 

f fx\e{x\0)g(9)d* n (9) _1 
$ f x]B {x\9)d* n {9) 0 n 



528 Chapter 8. Hierarchical Models 



Since s(Ao) = — e, it follows that, for every n, 
f fx\e(x\0)g{e)d* n (e) 



< Ao- 



J fx\e(x\9)dir n {d) ~ u J7x|e(s|*)*r»W 

These two inequalities imply that / f X \e( x \0)d^ n (0) > for every n. This 
contradicts sup w€r / fx\e(x\0)dir(0) < oo. □ 
To use Theorem 8.62 to find bounds on posterior means, we first note that 
lower bounds can be obtained by replacing g by — g and finding another 
upper bound. For fixed A, s n (A) is a linear function of 7r. In the case of 
e-contamination classes (T = C e ), it follows that 5(A) is the supremum over 
the set of contaminations of the form n(B) = (l-e)/ie+ cIb(0q) for 0o € ft. 
For such a 7r, the posterior mean of #(6) is 

(1 - e) f g(0)f X \e(x\0We(0) + fx\e( x \0o)eg(0 o ) 
(1 - c) / /*|e(*|0)d/ie(0) + e/ x)e (x|0o) 

One can usually find the supremum and infimum of this expression as 
a function of 0q using standard numerical methods. The two integrals, 
/ 9(O)fx\e( x \O)dii>e(0) and / fx\e(z\0)dfj.e(0), are constants in these nu- 
merical problems. 

Example 8.64 (Continuation of Example 8.60; see page 525). We have X ~ 
Exp(0) given G = 9, and /xe is the T(a, b) distribution. Let g(9) — 9. So 



/ 



9 (0)f xl e(x\e)d„e(O) = fffe^S ' 
/x|e(z|0)d/xeW = (b+ °^. + i - 



/ 



The function for which we need to find extremes is then 

_ (l-e)^;+e^exp(-^) 
() ~ (1 

If we let a = 6 = 1 and e = 0.1 as before, we can find the extremes of h{9) for 
every possible x. Figure 8.65 shows the lower and upper bounds on E(6|X = x) 
for x between 0.1 and 10. As x -> 0, the upper bound goes to 00 because x close 
to 0 is most consistent with very large values for 9. The bounds get very close 
together and small as x 00 because large x values are most consistent with 
very small values of 0. 

In addition to sensitivity analysis, one can try to find prior distributions 
such that resulting inferences exhibit some degree of robustness to changes 
in the prior. Consider the case in which X u • • • > X n ~ JV(j*, a 2 ) given 6 = 
(/i,<7). The natural conjugate prior is one of the form M ~ N(no,° /Ao) 
given E = a and E 2 - r^oA b 0 /2). Such priors have the property that 
the posterior mean of M is /xi = (fix + X 0 fio)/(n + Ao), which has a compo- 
nent X 0 fio/(n + A 0 ) that remains the same no matter what the data values 



8.6. Mixtures of Models 



529 



<§ 




o 



2 



6 



8 



X 



Figure 8.65. Lower and Upper Bounds on Posterior Mean of 6 

are. It is sometimes desirable to have the influence of the prior become less 
pronounced as the data move away from what would be predicted by the 
prior. Alternatives to natural conjugate priors, which are less influential, 
are ones in which M is independent of E 2 with t CQ (iiQ,T 2 ) distribution. We 
will assume that E 2 ~ r -1 (ao/2, bo/ 2) in this prior also. At first, it may 
seem difficult to work with such a prior because the posterior cannot be 
written in closed form. However, we can use the following trick to make the 
problem more tractable: Invent a random variable Y with r _1 (co/2, cqTq/2) 
distribution independent of E, and pretend as if M ~ N(fio,Y) given Y. 
The marginal distribution of M is then t Co (/i 0 » t§ ) as earlier prescribed. But 
now, if we treat (M, E, Y) as the parameter, we can use successive substi- 
tution sampling (SSS) because the following conditional distributions are 
obtained: 



M|E = (7,Y 



y 



N(ni(y,a),Ti(y,<j)), 




y|M = //,E 



a 



y 




where 



530 Chapter 8. Hierarchical Models 



n 

h(^y) = b 0 + Y^( x i ~ + n (* ~ M) 2 ' d ^V) = Wo + (M - Mo) 2 - 
i=i 

In fact, since the prior density for M will tend to be very flat relative to 
the likelihood function with even a small amount of data, this prior is a lot 
like using an improper prior such as Lebesgue measure. 

Of course, Bayesians can be interested in the same aspects of robustness 
in which classical statisticians are interested, namely robustness with re- 
spect to unexpected observations. In the classical framework, we introduced 
M-estimators (see Section 5.1.5) to be less sensitive to extreme observa- 
tions. In the Bayesian framework, this would correspond to using alter- 
native conditional distributions for the data given © to reflect the opin- 
ion that occasional extreme observations might arise. Consider the case in 
which Xi,...,X n ~ A/"(/z,cr), given 0 = (/i,cr), most of the time but in 
which an observation with higher variance is occasionally observed. Alter- 
natively, suppose that each observation X{ comes with its own variance 
and that the are exchangeable. If the conditional distribution of were 
r -1 (ao/2,aoT/2) given T, this would be equivalent to saying that the Xi 
had t ao (fi, r 2 ) distribution given M = /i and T = r. That is, we would have 
changed the likelihood from normal to t ao • Once again, it may seem diffi- 
cult to work with such a likelihood because the posterior cannot be written 
in closed form. However, we can again use SSS to make the problem more 
tractable. Suppose that we model M as N(fj,o,Y) given i , Zj \ , . . . , 2-/ n , l , 
and Y as independent of the and T with r~ 1 (c 0 /2, cor^/2) distribution, 
and T ~ r(d 0 /2, /o/2). The conditional posterior distributions needed are 

M ~ 

n ~ 

Y ~ 
T ~ 

where 

Tl(<T U ...,(Jn,y) = ^~+£^J ' 

/xi(<Ti,...,<r„,y) = ^+£|^n(*i,.. •,*»,»)• 
Alternatively, one could integrate numerically. 

Example 8.66. Suppose that we model the data as above with /x 0 = 0, ao = 5, 
co = 1, To = 2, do = 1, and fo = 1/2. This prior has the property that the 



N(ni(<?i, ■ ■ ■ , <J n , y), n(o"x, . . . , <r„, y)), 
r ' 2 J' 



_! / Cp + l CqT$ + (n - Hq) 2 \ 
V 2 ' 2 )' 

fdo + n /o + E"=i % \ 



8.6. Mixtures of Models 



531 



density of the data given the parameters has thinner tails than the prior density 
of M. This is because the degrees of freedom is 5 for the t distribution of the data 
given the parameters, but the degrees of freedom is only 1 for the t distribution 
of M. This will allow the posterior to resemble the likelihood to a large extent. 
Consider the following 10 observations: 

1.66, 1.07, 0.640, 0.310, 0.295, -0.070, -0.107, -1.67, -1.90, -1.97. 

The posterior mean of M is —0.077, and the posterior standard deviation of M is 
0.4202. (The sample average is —0.173, and the sample standard deviation over 
VlO is 0.4017.) We could now consider what changes if one of the observations 
moves off to oo. For example, suppose that we take the smallest observation, 
—1.97, and let it move to 0 and then to +oo. Figure 8.67 shows a plot of the 
posterior mean of M as a function of the moving observation. Notice that the 
mean of M increases almost linearly with the moving observation for some time 
and then begins to decrease again. The decrease is due to the moving observation's 
having reached a level beyond which it is more likely to be coming from the tail 
of the distribution than from a large value of M. 

The other curves in Figure 8.67 correspond to models with different degrees of 
freedom. All four of the cases with a 0 ,Co € {1,5} are illustrated. The value of Co 
does not have nearly as much influence as the value of a 0 . When a 0 changes to 1 
(with c 0 = 1 and the original data), the posterior mean and standard deviation 
of M become 0.1258 and 0.1388, respectively. (The MLE of M would be 0.2064, 
but the likelihood is not very peaked.) In this case, as one observation changes, 
the posterior mean of M is affected most by the average of those observations in 
the middle of the data set. When the moving observation enters the middle of 
the data set, the posterior mean of M varies linearly with the observation. But 
when it moves out of the middle, the posterior mean moves back down again. 

Of course, one straightforward way to develop robust models is to form 
mixtures of all sensible models for the data. Those models with high prior 




Moving Observation 

Figure 8.67. Posterior Mean of M as a Function of One Observation 



532 Chapter 8. Hierarchical Models 

Table 8.68. Relative Values of Prior Predictive Density 













fin 


—1.970 


0.531 


3.030 


8.030 


1 


0.252 


0.918 


0.367 


1.501 


2 


0.553 


1.048 


0.663 


1.412 


5 


1.000 


1.000 


1.000 


1.000 


10 


1.187 


1.001 


1.091 


0.945 


20 


1.283 


1.003 


1.110 


0.623 


60 


1.348 


1.003 


1.105 


0.221 


00 


1.381 


1.003 


1.097 


0.096 


average 


1.001 


0.997 


0.911 


0.828 



predictive density will surface as the ones that contribute most to the pos- 
terior predictive distribution of future data. 

Example 8.69 (Continuation of Example 8.66; see page 530). Suppose that we 
are not sure which degrees of freedom to use for the conditional distribution of Xi 
given (M, T). We might try a mixture of models with model i having ao = i for 
i in some set where oo = oo means that the conditional distribution is AT(M,T 2 ) 
rather than £ O0 (M,T 2 ). Table 8.68 lists the relative values of the prior predictive 
densities of the data for a few values of ao with ao = 5 taken as 1.0. Four 
different data sets are used; they differ only in the value of the last observation, 
which is listed in the column heading. The < 5 distribution is relatively robust 
as one observation increases, and the equal mixture of the seven models (the 
row labeled "average" ) has predictive density remarkably close to that of the £5 
model. One might argue that the equal mixture of the seven other models is not 
itself sensible. Putting 3/7 of the probability on {20, 60, 00} degrees of freedom 
is saying that one is somewhat confident that the data will be approximately 
normal. On the other hand, putting 3/7 of the probability on {1, 2, 5} degrees of 
freedom is saying that one is equally confident that the data will likely have an 
occasional "outlier." 



8.7 Problems 

Section 8.1: 

1. Let X n be the space of sequences of 0s and Is of length n + 1 which start 
with 0. Let T n (0, xi, . . . , x n ) be the four counts of transitions (from 0 to 1, 
from 0 to 0, from 1 to 1, from 1 to 0) in (0,xi, . . . ,x n ). Call these counts 
(Tn.o.1 , Tn.o.0, T n , 1,1 , Tn.i.o). Let r n (A y t) be uniform over all sequences with 
the appropriate numbers of transitions (that is, <o,i transitions from 0 to 
1, etc.). 

(a) Show that the conditions of Theorem 2.111 hold. 

(b) Find the extreme points of the set M. 



8.7. Problems 533 



(c) Write the representation of Theorem 2.111 as an integral over a finite- 
dimensional space. 

2. Suppose that {Yij : j = l,...,rii;i = l,...,fc} are conditionally inde- 
pendent with ~ N(0i, 1) given 6 = (0i, ...,9k) and M = \i. Suppose 
that 9i, . . . , 0fc are IID with iV(/x, 1) distribution given M = [i and that 
M~ jV(/i 0 ,l). 

(a) Find the marginal distribution of each Yij. 

(b) Show that the Yij are not exchangeable. 

(c) Find the posterior distribution of 0 and M. 

3. Suppose that {Yij : j = l,...,rit;i = l,...,fc} are conditionally inde- 
pendent with Yij ~ N(0i, 1) given 0 = (0i, . . . ,0*), M = \x, and T = r. 
Suppose that 0i, . . . ,0* are IID with AT(/i,r 2 ) distribution given M = \i 
and T = t. Show that the improper prior with density r -1 (with respect 
to Lebesgue measure) leads to an improper posterior for T. 

Section 8.2: 

4. Prove Proposition 8.13 on page 485. 

5. *Consider the data in the following table: 



i 


rii 




XijJ = 


l,...,n< 


1 


3 


9.549 


10.274 


7.142 


2 


2 


11.430 


11.890 




3 


3 


6.898 


4.329 


6.905 


4 


4 


12.620 


13.050 


12.530 11.890 



Model the Xij using the one-way ANOVA model of Section 8.2.1 with prior 
hyperparameters tpo = 15, Co = 0.25, &o = 1, do = 1, and A ~ Exp(0.2). 

(a) Find the product of the prior density of A and the "marginal likeli- 
hood" function of A. 

(b) Use numerical integration to find the normalizing constant for the 
posterior density. 

(c) Find the posterior predictive density of a future observation from 
the i = 4 group using a numerical integration method, importance 
sampling, or the method of Laplace. (Approximate the density at the 
points from 9 to 15 in steps of 0.05.) 

6. Prove that (8.20) on page 490 is true. 

7. Prove the following matrix theorem: If A and B are nonsingular matrices, 
then 



(A + B)~ l = A' 1 - A~ 1 (A~ 1 + B~ 1 )~ 1 A~ l , 
A{A + B)~ l B = {A~ l 4- B~ l )~ l . 



534 Chapter 8. Hierarchical Models 



8. Prove the following matrix theorem: Let £* = £i + £2, where £1 and E2 
are nonsingular symmetric matrices, and let a and b be vectors. Then 

a T Eia + b T £ 2 6 - (Eia + E 2 6) T E; 1 (Eia + E 2 6) 

= (a-6) T (Er 1 +E 2 - 1 ) _1 (a-6). (8.70) 

(Hint: Use Problem 7 above.) 

Section 8.4'- 

9. Each scientific paper published by a particular author receives a random 
number of citations from other authors in the years following its publi- 
cation. For i = 1, . . . , k, and j = 1, . . . ,n, let Xij denote the number 
of citations paper j received i years after publication. Let ri,...,r/t be 
known positive numbers. Model the Xij as conditionally independent given 
Mi, ... , M n , O with Xij having Poi(Mjri) distribution. We model the Mj 
as IID with Exp(6) distribution conditional on 0 = 0. 

(a) Find minimal sufficient statistics for the parameters Mi, ... , M n , 0. 

(b) Supposing that 0 is the only parameter, find a one-dimensional suf- 
ficient statistic. 

(c) Find the MLE 0 of 0. 

(d) Use the naive empirical Bayes approach (assuming 0 = 0) to find 
the posterior distribution of the parameters Mi, ... , M n . 

10. Suppose that Xi ~ Bin(rii,$i), i = 1, . . . , fc are conditionally independent 
given 0 = (0i, . . . , 0*). Suppose that we model the 0» as conditionally IID 
with Beta(a,/3) distribution given (A,B) = (a,/3). 

(a) Although the formulas cannot be written out completely, describe 
how one would implement the naive empirical Bayes approach using 
MLEs for A and B. 

(b) Use the following data, and compute the naive empirical Bayes pos- 
terior distributions and posterior means for the parameters. We have 
k = 2, rii = 5, n 2 = 10, Xi = 3, and X 2 = 3. 

Section 8.5: 

11. Suppose that X - T^fab) and Y - r -1 (c, d) are independent. Let^Z = 
X/Y. Prove that the conditional distribution of X given Z = z is T (a + 
c,6 -f dz). 

12. Using the notation of Problem 9 above, suppose that 0 has a prior distri- 
bution, which is T(a, 6), with a and b known constants. 

(a) Find the posterior density of 0 except for the normalizing constant. 

(b) Set up a successive substitution sampling scheme to generate a sample 
of 0 and Mi , . . . , M„ from the joint posterior distribution. 

(c) Write a formula for an approximation to the posterior density of 0 
and of Mi, ... , M n based on the successive substitution sample. 



8.7. Problems 535 



(d) Describe the similarities and differences between the above approxi- 
mations to the joint density for the Mj and the approximation found 
via the empirical Bayes approach. 

13. Let v and z be fc-dimensional vectors with Vi > 0 for i = 1, . . . , k. Suppose 
that X ~ JVfc(diag(v)2,diag(t;)), where diag(v) is a diagonal matrix with 
(i, i) element equal to v%. Prove that the conditional distribution of X given 
1 T X = c is 

N k f[diag(v) - -4~m; T ]2,diag(t;) - -r-vv^ . 
\ 1 v 1 1 v J 

14. *Prove that condition (8.41) holds if Y ~ Ar fc (/i,£) and SSS is applied to 

the coordinates in their natural order. (Hint: First, prove that the condi- 
tional distribution of the next iterate X given the current iterate X' is 
multivariate normal with constant covariance matrix and mean that is a 
linear function of X' . You can now integrate over x' analytically using facts 
from the theory of the multivariate normal distribution. The integral over 
x becomes the integral of the ratio of two normal densities. You can use 
problems 7 and 8 in this chapter to show that this integral is a constant 
times the integral of a normal density.) 

Section 8.6: 



15. The SSS algorithm allows one to approximate posterior distributions with- 
out calculating the marginal density of the data fx(x). When fitting mix- 
ture models, it is important to compute / X |*(x|t/>), where $ is the pa- 
rameter indexing models. Consider a single model with parameter 9 = 
(Bi, . . . , Bp). Each iteration of SSS starts with a simulated vector B (i) = 
(Bj 0 , . . . , B p °) and then simulates B (<+1) one coordinate at a time using 
the conditional posterior distribution of each coordinate given the others: 

fej |e Nj . ,x (Bj |0i* +1) , . . . , 6^6^, . . . , 0<<\ *). (8.71) 

Call the expression in (8.71) (Note that v} i+l) and Vp (i+1) have 

slightly different formulas due to the effect of being at the ends of the 
vector.) Prove that 

E //x,e(xlB^)) /e (B^))\ 

V n* =1 vf +i) )- fx ^ 

where the E(-) refers to the joint distribution of (B (i) , B (i+1 >) in the sim- 
ulation. 

16. Suppose that one wishes to fit a mixture of k models, but that one is 
required to use SSS to fit each model. Let 6 {1, . . . , k} index the different 
models, and let Bi be the parameters of model t, for t = 1, . . . , jfe. Explain 
how one could use Problem 15 above to help estimate U\ X {ip\x) for all 
values of -0. 



Chapter 9 
Sequential Analysis 



Most of the results of earlier chapters concern situations in which a partic- 
ular data set is to be observed and the decisions, if any, to be made concern 
the values of future observations. It sometimes happens that as we observe, 
we get to decide what, if any, data to collect next. In this chapter, we will 
describe some theory and methods for dealing with such situations. 1 

9.1 Sequential Decision Problems 

As a simple example of a situation in which we need to decide whether or 
not to collect more data, consider the following. 

Example 9.1. We are considering purchasing a shipment of parts. Prior to ob- 
serving any data, we believe that the proportion P of defective parts has a 1/(0, 1) 
distribution. We believe that the individual parts (X» = 1 if part i is defective) 
are conditionally independent Ber(p) random variables given P = p. We decide 
that we can sample at most 10 parts, and we will reject the shipment if the 
posterior mean of P is greater than 0.6. This will occur if there are 7 or more 
defectives out of a sample of 10. Suppose that the first seven parts are defective. 
Clearly, there is no need to sample any more parts. Similarly, if the first six parts 
were defective, it might seem highly unlikely that the shipment would be accept- 
able. Whether or not to continue sampling would depend on the relative costs of 
sampling and of rejecting a good shipment. 

The general sequential decision problem can be defined as follows. 



x The discussion in Section 9.1 is largely adapted from Chapter 12 of DeGroot 
(1970). 



9.1. Sequential Decision Problems 537 



Definition 9.2. Let (5, be a probability space, let (V,£>) be a mea- 
surable space, and let V : S -> V be a random quantity. For i — 1,2,..., 
let (Xi,Bi) be measurable spaces and let (Ao,Bo) = ({0}> be a 

trivial space. For i = 0, 1, . . ., let X{ : S — ► Xi be random quantities. Let 
X = %i with product cr-field B°°. Let B n be the sub-cr-field generated 
by the first n + 1 coordinates (including 0). That is, J3 G B n if and only if 
B = C x n£ n+ i where C G B 0 ® • • • ® B n . Let X = {X 0 ,X U X 2 , . . .)■ 
A stopping time is a nonnegative, extended integer- valued function AT (i.e., 
iV € A/* = {0, 1, . . .} U {oo}) defined on X such that, for every finite 
n, {x : AT(x) = n} is measurable with respect to B n . Let the action 
space be N = N' x A/*, and let the loss be L : V x H — ► 1R such that 
L(t;,(a, n)) = Y17=o c i + L'(v,a), where C{ > 0 for all i (c 0 = 0). Let a 
be a cr-field of subsets of N', and let Va be the collection of probabil- 
ity measures on (N',a). A randomized sequential decision rule is a pair 
6 = (6*,N) where 6* : X Va and AT is a stopping time. The function 6* 
is called the terminal decision rule. A nonrandomized sequential decision 
rule is a randomized sequential decision rule such that, for each x G X, 
there is 6'(x) € N' such that for each A e a 6*{x){A) = 1 if <5'(x) G A 
and 6*(x)(i4) = 0 if <5'(;r) ^ A. If <5 is nonrandomized, then 6' is called the 
terminal decision rule. 

A convenient notation will be to let X n = (Xo, . . . , X n ) for finite n 
and X°° — X if necessary. Also, x n = (xo,xi, . . . ,x n ) and x°° — x for 
x e X. Note that we have assumed that a stopping time might be infinite. 
If there exists x such that N(x) = oo, then 6*{x) must still be defined, even 
though it is hardly a "terminal" decision. For convenience, we will often 
write summations like Y^=i to indicate that one extra term for n = oo 
is to be included in the usual sum Yl™=i- One way to prevent decision 
rules from taking infinite samples with positive probability is to require 
that YiiLi Ci = oo so that any rule that takes infinite samples with positive 
probability must have infinite risk. 

If we can restrict attention to decision rules (6\N) such that N < n, 
then there is an intuitively simple method of finding the optimal sequential 
decision rule. The idea is to decide what would be the optimal decision and 
its risk after observing X n , then compare this to what the risk would be 
if we only observed X n ~ l . Whether N(x) = nor N(x) = n — 1 is decided 
is based on which is smaller. We now know what the optimal procedure is 
after observing n — 1 observations and we know its risk. Compare this to 
what would be optimal if we stopped after X n ~ 2 , and so on. This procedure 
is called backward induction. Consider an illustration. 

Example 9.3. Suppose that {X n }^Li are conditionally IID Ber(6) given 0 = 6 
and 6 has C/(0, 1) distribution. Suppose that we can take at most four observa- 
tions. The action space has N' = {0,1}, and the loss function is L(9,(a,n)) = 



538 Chapter 9. Sequential Analysis 



O.Oln + L'(0,a) with 

{1 if 0 > 0.4 and a = 0, 
1 if $ < 0.4 and a = 1, 
0 otherwise. 

It follows that the optimal action, after iV is determined, is to choose a = 1 if 
the posterior probability of 0 < 0.4 is less than 0.5. If we observe X A = x 4 , there 
are five possible posteriors depending on the value of 3/4 = X^Li^*' ^he " s ^ s 
are the probabilities of wrong decision plus 0.04 for the four observations. 



2/4 


Posterior 


Pr(9 < 0.4|X = x) 


a 


Risk 


0 


£e*a(l,5) 


0.9222 


0 


0.1178 


1 


£eta(2, 4) 


0.6630 


0 


0.3770 


2 


Beta(3, 3) 


0.3174 


1 


0.3574 


3 


£e*a(4, 2) 


0.0870 


1 


0.1270 


4 


Beta{b, 1) 


0.0102 


1 


0.0502 



Next, suppose that we only observe X 3 = x 3 . Let 2/3 = x\ + xi + £3. The 
posterior will be Beta(yz + 1,4 — t/3), and the predictive distribution for X4 is 
#er([2/3 + l]/5). The risk for stopping is just 0.03 (for the three observations) 
plus the probability of wrong decision based on three observations. The risk for 
continuing is the weighted average of the two possible risks that could occur 
depending on the value of X4. For example, if ys = 2, the predictive distribution 
for Xt is £er(0.6), and the risk for continuing is 0.6 x 0.1270 + 0.4 x 0.3574 = 
0.2192. For the other values we calculate 





0 


1 


2/3 

2 


3 


Posterior 


Beta{\,A) 


Beta(2, 3) 


Beta(3, 2) 


Beta(4, 1) 


Pr(8 < 0A\X = x) 


0.8704 


0.5248 


0.1792 


0.0256 


a 


0 


0 


1 


1 


Risk(stop) 


0.1596 


0.5052 


0.2092 


0.0556 


Pr(X 4 = 1) 


0.2 


0.4 


0.6 


0.8 


Risk(continue) 


0.1696 


0.3692 


0.2192 


0.0656 


Stop 


yes 


no 


yes 


yes 


Risk 


0.1596 


0.3692 


0.2092 


0.0556 



So, only if 2/3 = 1 would we continue to observe X 4 . Next, suppose that we only 
observe X 2 = x 2 , and let y 2 = x\ + x 2 . 





0 


2/2 

1 


2 


Posterior 


Beta{l,3) 


Beta(2, 2) 


Beta(3, 1) 


Pr(6 < 0.4|X = x) 


0.7840 


0.3520 


0.0640 


a 


0 


1 


1 


Risk(stop) 


0.2360 


0.3720 


0.0840 


Pr(X 3 = 1) 


0.25 


0.5 


0.75 


Risk(continue) 


0.2120 


0.2892 


0.0940 


Stop 


no 


no 


yes 


Risk 


0.2120 


0.2892 


0.0840 



We would continue if y 2 € {0, 1}. Next, suppose that we only observe Xi = xi. 



9.1. Sequential Decision Problems 539 





0 


X\ 

1 


Posterior 


Beta(l,2) 


Beta{2, 1) 


Pr(8 < 0.4|X = x) 


0.6400 


0.1600 


a 


0 


1 


Risk(stop) 


0.3700 


0.1700 


Pr(X 2 = 1) 


1/3 


2/3 


Risk(continue) 


0.2377 


0.1524 


Stop 


no 


no 


Risk 


0.2377 


0.1524 



If we take one observation, we will take two. Finally, before we take any obser- 
vations, Pr(0 < 0.4) = 0.4, so the terminal decision would be a = 1 and the risk 
would be 0.4. On the other hand, Pr(Xi = 1) = 0.5, so the risk of continuing is 
0.5 x 0.1524 + 0.5 x 0.2377 = 0.1951. Hence, we should take the first observation. 
To summarize, the optimal procedure is 



Data 


(1,1,.,0 


(0,0,0,.) 


(1,0,1,) 


(0,1,1,.) 


(1,0,0,0) 


N 


2 


3 


3 


3 


4 


a 


1 


0 


1 


1 


0 


Data 


(0,1,0,0) 


(0,0,1,0) 


(1,0,0,1) 


(0,1,0,1) 


(0,0,1,1) 


N 


4 


4 


4 


4 


4 


a 


0 


0 


1 


1 


1 



where the dots stand for observations that do not need to be taken. 

To compare with other procedures, there is the fixed sample size procedure 
with n = 4, which has risk 0.2239. This risk is the average of the five possible 
risks after four observations because each of the five possibilities has probability 
1/5. This procedure rejects H if 2/4 € {2,3,4}. The optimal procedure which 
takes at most three observations has risk 0.2232 (see Problem 1 on page 567). 

After reviewing Example 9.3, it is clear that if 6 is a sequential decision 
rule such that Pr(iV = 0) > 0, then Pr(iV = 0) = 1, since we have not 
allowed any randomization in the decision of whether or not to take ob- 
servations. The decision as to whether to take any observations is based 
on the prior distribution and the various costs. No randomness is involved; 
hence {N = 0} is either 0 or all of S. 

For a general problem, let Q be a probability on V (usually the parameter 
space f2). 2 Define 

p 0 (Q) = min / L'(u,a)dQ(u), 

the minimum risk possible without taking any observations if the prior is 
Q. If Q denotes a prior distribution, then for each n, let Q n ('\x) denote 
the conditional distribution obtained from Q by conditioning on Xq = 
xo,...,X n = x n . (For n = 00, Qoo('\x) denotes conditional probability 
given X = x.) In particular, Qq(-\x) = Q. If N is a stopping time, then 



2 It may be that Q is already the conditional distribution obtained from some 
other probability P after conditioning on some observations. 



540 Chapter 9. Sequential Analysis 

Q N ('\x) will denote Y^n=i Qn(-\x)I {n} (N(x)). (See Problem 3 on page 567 
for an alternative understanding of Q n and Qn>) Suppose that we observe 
X n and make the best possible decision. Then po(Q n ('\x)) + Ya=o c * * s ^ e 
risk including the cost of observations. 

Definition 9.4. Let Q be a prior distribution on V. Suppose that 6 = 
(6*, N) is a sequential decision rule such that for every n (finite or infinite) 
and every x G {x : N(x) = n}, 



Then 6 is said to decide optimally after stopping. 

Another way to describe what it means for 6 = (£*, N) to decide opti- 
mally after stopping is to say that if N(x) = n, then 6*(x) is the same as 
the formal Bayes rule for a sample of size n. A decision rule that decides 
optimally after stopping may not have an optimal stopping time, but once 
the decision to stop is made, the optimal terminal decision is made. Clearly, 
the formal Bayes rule in a sequential decision problem will decide optimally 
after stopping (see Problem 2 on page 567). For a decision rule that decides 
optimally after stopping, the Bayes risk is 



Definition 9.5. Suppose that 6 decides optimally after stopping and Q is 
a prior on V. We say that 6 is regular if p{Q,6) < p 0 (Q) and if, for every 
finite n > 0 and every x e {x : N(x) > n}, 



In words, a decision rule is regular if, whenever the stopping time has not 
yet occurred, the risk of stopping is larger than the risk of continuing. The 
rule in Example 9.3 on page 537 is regular, as is every backward induction 
rule. (See Problem 4 on page 568.) 

Theorem 9.7. If 8 decides optimally after stopping, then there is a regular 
6i such that p(QJ\) < p(Q,<5). 

PROOF. Define 6i as follows. The terminal decision rule for 6 X is to decide 
optimally after stopping (just like 6). The stopping time N x for *i is the 
smaller of N (the stopping time for 6) and the first time at which (9.6) 
fails. Clearly, this is finite and is a stopping time since both sides of (9.6) 
are B n measurable. If 6 is regular, then (9.6) never fails and S x = 6. Next, 






(9.6) 



9.1. Sequential Decision Problems 541 



note that both sides of (9.6) are equal for each x such that N(x) = n. So, 
we can compute p{Q,6\) as 



oo+ 



Ni(x)=n} 



ft 

Po(Qn(-|x)) + £ci 



i=l J 



dF X n(x n ) 



n=0 J i x: 

< £/ E]p 0 (Qw(-w)+y]c i ^(i") 

= Y, E [p(Q,6)Wi=n}Pr(N 1 = n) = p(Q,6). □ 

n=0 

Regular decision rules do not sample too many observations, but they 
may not sample enough. That is, whenever a regular decision rule contin- 
ues sampling, the risk for continuing is smaller than the risk for stopping. 
However, when a regular decision rule stops, the risk for continuing may 
still be smaller than the risk for stopping. For example, the optimal rule 
from a class of rules whose stopping times are all bounded by the same n 
is regular. 

Proposition 9.8. The optimal rule from the class of sequential decision 
rules that sample no more than n observations is regular. 

Definition 9.9. If 6 { = ty) is a regular decision rule for i = 1, . fc 
the maximum of6 u ...,6 k , denoted max{« 1 , ... A}, is the decision rule 
with stopping time N = max{N u . . . , N k } and terminal decision rule to 
decide optimally after stopping. 

Th u°T m c 9 ' 10 ' Let Q be a pHor 071 V ' • • • ■ h are regular with finite 
risk then 6 Q = maxfo,...,^} is regular and p(Q,6 0 ) < p (Q, Si ), for 
% — 1, . . . , k. 

Proof. We need only prove this for k = 2 because the general case follows 
easily by induction. It is clear that 

X = {x : Ni{x) = N 0 (x)} U {x : JVi(x) < N 2 (x)}. 
Then SUPP ° Se ^ Nl(X) < N2(X) ' Th6n Ar ° (x) = N2(X) - Let n = 



E 



|po(Qiv 0 (-|X)) + ^c i X" = x n | 



W 2 



i=0 



X n = X n 



542 Chapter 9. Sequential Analysis 



< P0 (Qn(-\x)) + * = E < MQaTiOPO) + £ Ci 



i=0 



z=0 



X n = x n 



(9.11) 



The first equality is true because <5o and 62 agree for all a; such that A^z) = 
No(x). The inequality follows since <5 2 is regular and iV 2 (x) > n. The last 
equality follows since N\(x) = n. Next, suppose that N\(x) = iVo(x) and 
n = JV 2 (x) < ATo(x): 



No 



e</*(Qn 0 (1*)) + I> 

I t=0 

= E^ Po (Q Nl ('\X)) + Yl c i 

n 



i=0 



(9.12) 



i=0 



= E\p 0 (Q Nl (-\X)) + Y, Ci 



i=0 



-) 



The reasons for each line are the same as before except that the inequality 
is only strict if Ni(x) > n. (Note that (9.12) holds even if n = oo.) Together 
(9.11) and (9.12) show that 6 0 satisfies (9.6). In both of (9.11) and (9.12), 
n = min{JV"i(x),N2(x)}. Write 



C n = {x: naa{Ni(x), N 2 (x)} =n}, X = Q C n , 



(9.13) 



oo f ( Ni 

?(<M) = £/ E^Ww, (•!*)) + £« 
n=Q Jc n y i=0 



n=0 



X n = x n }dF x *(x n ) 



for j = 0,1,2. Together (9.11) and (9.12) say that the integrand in the 
second line of (9.13) for j = 0 is no greater than for either j = 1 or j = 2. 
The inequalities in the conclusion to the theorem follow. □ 
If r = mf 6 p(Q,6), then there is a sequence {Si}^ such that r = 
linii-.oo p{Q, Si). Finding such a sequence is not as difficult as it may seem. 

Definition 9.14. Let 6 = (6\N) be a regular sequential decision rule. Let 
N' be a stopping time. The truncation of 6 at N' is the decision rule with 
stopping time min{AT, A/ 7 } and terminal decision optimal after stopping. 

Lemma 9.15. 3 Let 6 0 be the optimal rule in a sequential decision problem, 
and suppose that 6 0 has finite risk. For each n = 1,2,..., let 6 n be the 



3 This lemma is used to help prove Corollary 9.17. 



9.1. Sequential Decision Problems 543 



truncation of 60 to at most n observations. Define 

Pn= f Po{Qn{'\x))dF X n(x n ). 
J{x:N 0 (x)>n} 

J/limn^ooPn = 0, then lim^oo p(Q, S n ) = p{Q,6 Q ). 

Proof. If N = 0, the result is trivial, so suppose that N > 1. For a general 
decision rule 6 = (<5*,iV), define 

r„(«)= / Po(Qn(-\x))dF X n(x n ) + PT(N = n)Y,Ci- 

J{x:N(x)=n} i=0 

We know that for A: = 1, . . . , n - 1, 

{x : N 0 (x) = k} = {x: N n (x) = k} 

and 

{x : N n (x) = n} = {x : JV 0 (a:) = n} U : 7V 0 (ar) > n}. 
So T fc (6 n ) = T k {6 0 ) for fc = 1, . . . ,n - 1 and 

n 

^n(«n) = T n («o) + Pn + Pr(AT 0 > n) ^c<. 

i=l 

So, we can write 

00 n 

p(Q,«o) = Y]T n (6 0 ) = lira TTiido) 

' n— >oo ' 
n=l i=l 

n— 1 n n 

p(Q,* n ) = ^ T ^^) + T -(^) = E T ^ 6 o)+Pn + Pr(iVo>n)^Ci. 

i=l i=l i=l 

Since lim n _>oo Pr( N 0 > n) — 0 and lim n _»oo p n = 0, the result follows. □ 

Lemma 9.16. 4 Suppose that V > 0 and\im n -, oo Ep 0 (Q n ('\X)) = 0. Then 
lim n _>oop n = 0, where p n is defined in Lemma 9.15. 

Proof. Since p n is the integral of po{Qn('\x)) over a subset of X 0 x • • • x X n 
and the integrand is nonnegative, p n is less than Epo(Q n (-\X)). □ 
These last two results combine into a corollary that provides a sequence 
of decision rules with risk converging to the optimal risk. 

Corollary 9.17. Suppose that V > 0 and lim n _oo Ep 0 (Q n ('\X)) = 0. Let 
S nj o be the optimal rule among those that take at most n observations. Then 
limn^oo p(QA,o) = p(Q,6o)- 



This lemma is used to help prove Corollary 9.17. 



544 Chapter 9. Sequential Analysis 



Example 9.18 (Continuation of Example 9.3; see page 537). If X n = x n is ob- 
served, let y n = ]Cr=i Xi ' Then po(Q n ('\x)) is the smaller of the two probabilities 
that a Beta(y n + 1, n - y n + 1) random variable is at most 0.4 or is at least 0.4. 
Uyn/n converges to anything other than 0.4, one of the two probabilities will go 
to 0 and the other to 1. Since y n /w will converge to something other than 0,4 
with probability 1 and po is bounded, the dominated convergence theorem A. 57 
says that lim n — oc Epo(Q( |Ao, . . . , X n )) = 0. Hence, we could find rules with ap- 
proximately optimal risk by taking a sequence of optimal rules among the classes 

of those that take at most n observations for n = 1,2, The method used for 

n = 4 is easily generalized to arbitrary n. Here are the computed risks for the 
optimal rules 8 n for several values of n: 



n 


5 


10 


20 


50 


100 


200 


Risk 


0.1921 


0.1720 


0.1643 


0.1631 


0.1631 


0.1631 



After n = 75, the risk did not change in the first eight significant digits. After 
n = 125, sixteen significant digits remained constant. 

Example 9.19. Suppose that {X n }%Li are conditionally IID N(fi,a 2 ) given 
(M,E) = (/z,<r) and M ~ JV(/i 0 , <7 2 /A 0 ) given E = a and E 2 - r'^ooA 6 0 /2), 
with ao > 2. Let = IR, and let the loss be L((/x, a), (a,n)) = cn + (ji - a) . 
The posterior distribution of M given X n = x n is t an (/i n , bn/[A n a n ]), where 

x x , Ao/io + rix~n 

A n = Ao + n, /i n = 



0>n = CLO + 



An 

— \2 ftA()/_ 



n, b n — b 0 + y^(xj - a; n ) 2 4- -^(x n - Mo) 2 - 



Hence, the optimal decision after stopping at N = n is a = ^ n and 



Po(Q«(-|x)) = 



(o„ - 2)A n 



The prior mean of this is 60/ [(ao - 2)(A 0 + n)], which goes to 0. It follows that 
the risk of the optimal procedure that takes at most n observations converges 
to the optimal risk. If a 0 < 2, then p 0 (Q) = 00, and it pays to take one or two 
observations until a n > 2. At this point, pretend that the problem starts over 
and use the above reasoning. 2 

If we modify the problem to have loss L((/x, <r), (a, n)) = cn + (m ~ a) » th en 
MQn('lx)) = 1/An, which depends on the data only through n. Hence, it is easy 
to see that the optimal rule has N = n with probability 1, where n provides a 
minimum to cn + l/(Ao + n). 

Proposition 9.20. If there exists finite n such that po(Qn(-\x)) < c n+i for 
all x, then the optimal procedure takes no more than n observations. The 
optimal procedure is a fixed sample size procedure if po(Q n {'\x)) depends 
on the data only through n. 

In general, it is quite difficult to specify the optimal sequential decision 
procedure. The first part of Example 9.19 is one such case. To find or 
approximate the optimal rule in general, we will suppose that the cost 
of each observation is the same and that the available observations are 



9.1. Sequential Decision Problems 545 



exchangeable. That is, assume that c n — c for all u and {^ n }^Lj are 
exchangeable. If we let p*{Q) = inf$ p(Q,6) denote the risk of the optimal 
rule, then it is not difficult to see that 

p*(Q) = min{po(Q), E(p'(Q!(.|X))) + c}, (9.21) 

since the second term is just the mean of the optimal risk of continuing 
after the first observation given the first observation. If this is smaller than 
the optimal risk for no data, then it is the optimal risk. Otherwise, the 
optimal decision is to take no data and po(Q) is the optimal risk. Clearly, 
the optimal sequential decision rule is to stop sampling at N(x) = n, where 
n is the first time that po(Q n {'\X)) = p*(Q n (-\X)). This prescription is only 
useful if we know p*. As we will demonstrate in the next theorem, we can 
approximate p* by using successive substitution (see Section 8.5). 

Theorem 9.22. Let Q be a probability measure and suppose that V > 0 
and lim^oo Ep 0 (Q n ('\X)) = 0. Define 

p„+i(Q) = min{p 0 (Q),E(p n (Q 1 (.|X))) + c}, 

for n = 0, 1, — Then limn—cx, p n (Q) = P*(Q) and p n {Q) is the risk of the 
optimal rule among those that take at most n observations. 

Proof. Clearly, in light of Corollary 9.17, we need only prove that p n is 
the optimal risk for rules that take at most n observations. We will use 
induction. We know that po is the optimal risk among rules that take no 
observations. Suppose that pk is the optimal risk among rules that take at 
most k observations for some k > 0. Then 

E(p fe (Q 1 (-|X))) + c (9.23) 

is the risk for taking at least one observation and then using the optimal 
rule that takes at most k more observations. The optimal rule that takes 
at most k + 1 observations must either take at least one observation or take 
no observations. Hence, the risk of the optimal rule that takes at most k + 1 
observations is the smaller of (9.23) and po(Q). That is, 

min{p 0 (Q)Mpk(Qi('\X)))+c} = p fc+ i(Q). □ 

Theorem 9.22 can be applied to Qk(-\X) to produce the following corollary. 

Corollary 9.24. Let Q be a probability measure, and suppose that L' > 0 
and lim^oo Epo(Q n (*|^)) = 0. For each n and k, the conditional mean of 
the risk of the optimal rule among those that take at most n+k observations 
given the first k observations and given that the optimal rule takes at least 
k observations is p n (Qk(-\X)) + cfc. 

Corollary 9.24 can be used to define an alternative decision rule. 



546 Chapter 9. Sequential Analysis 



Definition 9.25. The decision rule that continues to sample until the first 
n such that Po(Qn(-\%)) = Pk(Qn('\x)) is called the k-step look-ahead rule. 

Example 9.26 (Continuation of Example 9.19; see page 544). It is incredibly 
difficult to calculate p n for n > 2. We illustrate here how to calculate p n for n = 
1, 2. The posterior distribution is determined by four hyperparameters (a, 6, /z, A) 
and 

After observing Xi = x, let the posterior hyperparameters be 

(a,M,A)(x) = ( a + 1 ' 6+ ATT (x ~^ )2 '^Tf' A ' f 0 
= (a + l,6(x),/i(x),A + 1). 

We can write b(Xi) = (1 + Y 2 )b, where 



y = Xi - /i 



Vm~<°(°'D- (9 - 27) 



V5 

In particular, E(K 2 ) = l/(a - 2). It follows that 

1 + F 2 6 



E(po((o,fc,/i,A)(Xi))) = 6E 7 



(A+l)(o-l) (A+l)(o-2)' 
So, 

p 1 (a,6 lM ,A) = min {A(^2)' C+ (A + l)(a-2)} 

lin {A(A + l)(a-2)' C }' 



+ min • 



pi((a,b,n,\)(x)) = 



(A + l)(o-2) 
(A + 2)(o-l) 

J C if C < (A+1 ) ( A + 2)(o-l)' 

1 (a+mSVi) ifDOt 

1 + y 2 f c 2 if|yl> r > 

6 (A + 2)(a-l) + \ fc (A+1)( V? a a )( .-i) ^ not, 



where y is as in (9.27) once again, and 



c(A+l)(A+2)(a-l) _ j if c(A + 1)(A + 2 )(a - 1) > 6, 

if not. 



It follows that E(pi((a,6,/x,A)(Xi))) equals 



9.1. Sequential Decision Problems 547 



- (> + 2)(a-2) - | - e( '-' l) 

+ (A + + 2)(a — 1) £ ?^" + " r " il * 

= (A + 2)(a-2) +C ^"P) + ( A + l)(A + 2)(a-2) <7 ' 

where p = Pr(|F| < r) and g = Pr(|Z| < r), where Z ~ t a -2(0, l/[a - 2]). 

We could now calculate P2(a,6, /i, A) after each observation. If po(a,6, p, A) is 
greater than p2, we should continue to sample. If po(a, 6,/z,A) equals p2, the 
two-step look-ahead rule would stop. We could, however, try to achieve a bet- 
ter approximation to p* . One way to do this might be to numerically integrate 
P2((a,6,/z, A)(x)) times the predictive density of the next observation in order to 
approximate ps{a, 6, p, A). 

Consider the results in Table 9.30. We used a prior with ao = 3, 6o = 8, po = 0, 
and Ao = 1. The cost per observation was c = 0.1. After the fourth observation, we 
do not know whether or not po = p*. If we numerically integrate p2, we get p3 — 
P2. This means that we would have to consider at least four more observations 
before there was any chance that the optimal rule would continue sampling. But 
four more observations would cost 0.4 more without taking into account the 
loss from squared error. Since the mean of po with four more observations is 
just 5/9 times the current po, which equals 0.337076, it seems unlikely that four 
more observations would bring the risk down enough to justify continuing. In 
fact, the lowest possible posterior risk we could obtain from sampling four more 
observations would occur if all four of them were equal to the current posterior 
mean, and then the risk would be 0.587264, which is barely less than po. 

Another way to approximate p* is from below. It is possible (see Prob- 
lem 6 on page 568) to show that if 0 < 70 < p* (for example, 70 = 0) 
and 

7nW) =min{po(Q),E( 7 n-i(Qi(-W)) + c} (9.28) 

for n = 1, 2, . . ., then j n < p* for all n and lim n _>oo 7 n (Q) = P*(Q)- 

Example 9.29 (Continuation of Example 9.3; see page 537). Suppose that we 
observe X\ = X2 = 1 and we are concerned with whether or not the optimal 
rule stops at this point. We already saw that the optimal rule that takes at 
most four observations stops at this point, but the optimal rule might continue. 
The terminal risk is 0.064 (not counting cost of observations). The posterior is 
Beta (3, 1). Treating J3e£a(3, 1) as the prior, we can compute p n and 7 n for as 
many n as we desire. We get p n = 0.064 for all n and 7 n = 0.064 for n > 33. 



Table 9.30. Two-Step Look-Ahead Rule for Example 9.26 



i 


Xi 


Po 


P2 


0 




8.000000 


2.866667 


1 


-0.129354 


2.002092 


1.201046 


2 


-2.158607 


1.214599 


0.928760 


3 


1.558454 


0.935753 


0.915571 


4 


-0.677818 


0.606737 


0.606737 



548 Chapter 9. Sequential Analysis 



This means that the optimal risk for continuing from this point is 0.064 and we 
should stop now. 

Suppose that we observe X\ = X* = 1 and X2 = X3 = 0. The optimal rule 
that takes at most four observations has to stop at this point, and the terminal 
risk is 0.3174. The posterior is Beta(3,3). Treating this as the prior, we could 
calculate p n and 7 n for many n. At n = 100, they are both 0.2274. This means 
that the optimal rule would continue sampling and that the optimal risk for 
continuing (not counting cost of current observations) is 0.2274. 



Just as hypothesis tests can be introduced as special cases of decision rules, 
sequential hypothesis tests are special cases of sequential decision rules. Just 
as sequential decision rules require a more general setup than fixed sample 
size rules, sequential hypothesis tests require a slightly more general setup 
than fixed sample size tests. 

Definition 9.31. Suppose that Xi € Xi are random quantities for i — 
1,2,.. .. Let X = (Xi, Xi, . . .)- 5 Let Vo be a parametric family of distribu- 
tions for X with parameter space f2. Let n CIa = 0 and Qh U Ha = H. 
A sequential test of a hypothesis H : 6 € Qh versus A : 9 € ft a is a pair 
of functions (0, N) where N is a stopping time and (j> : X [0, 1] gives the 
conditional probability of rejecting H given X = x. 

Example 9.32. Let {X n }n=x be conditionally IID with N(0, 1) distribution 
given 0 = 0. Let Q H = (-oo,0 o j and tt A = (0o,oo). Let {v n }n=i and {w n }n=i 
be sequences of positive real numbers. The following is a sequential test of H 
versus A: 



where x n is the average of the first n coordinates of x. 

The Neyman-Pearson lemma 4.37 was the starting point from which 
the theory of hypothesis testing originated. In sequential testing problems, 
there is a similar starting point. We need to begin with a parameter space 
consisting of only two points Q = {0, 1}. Suppose that P< has a density 
fi with respect to some measure v (such as P 0 + -Pi)- Tnat is > i x n}%Li 
are conditionally IID with density /< given 0 = t. When we have observed 
Xi = £1, • • • , X n = x n , we will calculate the likelihood ratio 



9.2 The Sequential Probability Ratio Test 




N 



min{n : x n — #o 0 (—w n ,v n )}, 

( 1 if N < 00 and xn > Vn, 
I 0 if N < 00 and xn < -Wn, or if N = 00, 



L n (x) = 



n?=i/o(*;)' 



Classical decision rules cannot stop at N = 0, because prior information ii 
not used. Hence, we have dispensed with the Xo term in this setting. 



9.2. The Sequential Probability Ratio Test 549 



which tells us how much more likely the data are under P\ than under Po- 
The sequential probability ratio test [see Wald (1947)] SPRT(2?,j4) is, for 
each n, to reject H : 6 = 0 if L n (x) > A, accept H if L n (x) < B, and to 
continue sampling if B < L n (x) < A, where 0 < B < 1 < A. Another way 
to write this is to let 

N(x) = M{n:L n (x)t(B,A)}, 

and reject H if L^^ x ){x) > A, accept H if Ln( x )( x ) < B. 

It is clear that {x : N(x) = n} is measurable with respect to the correct 
cr-field. We would like to show that N is finite, a.s. 

Theorem 9.33. Let {Z n }£° =1 be IID with Var(Zi) > 0. Let S n = z i 
and N — inf{n : S n & (6, a)}, where b < a. Then Pr(N < oo) = 1. 

PROOF. Let c = |a| + |6|. Choose r large enough so that rVar(Zi) > c 2 . For 
each multiple of r, that is n = rfc, write 

ir 

Ei = ^ Zj, S n = E\-\ f-Sfc. 

j=(i-l)r+l 

If |H m | > c for some m, then N < rm because Si would have to move 
across one of the boundaries between i = r(m — 1) and i = rm if it has not 
done so already, since c is the distance between the boundaries. It follows 
that 

{TV = oo} C {\Ej\ < c, for all j}. 

We know that EE? > rVar(Z J ) > c 2 . From this it follows that p = Pr{\Ej\ > 
c) > 0. Since the Ej are IID, 

oo 

Pr(iV = oo) < Pr(|S,| < c, i = 1, 2, . . .) = l[(l-p) = 0. □ 

t=i 

When we apply Theorem 9.33, we will let Z { = log[/i(Xi)// 0 (Xi)], a = 
logA, and b = log B. 

Theorem 9.34. If a = P 0 (L;v > -A) and (3 = < B), then a < 

{\-P)IAandp<{\-a)B. 

Proof. Since {N = n} is in the a-field generated by X\, . . . , X n , it follows 
that 

oo 

a = ^P 0 (iV = n,L n > A) 

n=l 

oo p n 

= y2 I T[fo(xi)di/(xi)'"du{x n ) 



550 Chapter 9. Sequential Analysis 

n=l J {N=n,L Tl >A} i=1 /H x */ i=1 

= V / — TT/i(xi)di/(xi)---di/(a: n ) 

ZZ[J{N=n,L n >A} L n 

1 °° f n 
A ^lJ{N=n,L n >A}?Ji 

= jP l (L N >A) = j{l-(3). 

Similarly, 0 = Pi(L N < B) < BP 0 {L N < B) = 5(1 - a). □ 
If we ignore the overshoot of the boundaries, we can replace the inequal- 
ities by equalities and solve the equations for 

„ ~ 1 ~ B r ~ n A ~ l 

' ~ l -P B « ' 



Theorem 9.35. Lei a* and 0* 6e strictly between 0 and 1. The SPRT 
with A = (1 - 0*)/a* and B = 0*1 {I - a*) /ias operating characteristics 
a = P 0 (Ln > 4) and 0 = Pi(L N < B), tiAuA satis/y a + /3 < a* + j3*. 

Proof. If a < a* and /?</?*, the result is clearly true. So, suppose that 
either /? > /?* or a > a*. (We will see shortly that both inequalities cannot 
occur simultaneously.) If 0 > /?*, then 1 - /3 < 1 - 0* and 

a < ^(1-/3) = ^(1 



It now follows that 



0* <P<B(l-a) = /3*j—^. 



Hence, 



It follows that 



a*+/r-a-/J = (a* - a) + (/T - /J) 

> a -a- 0 — 

1 — ar 

= (a* - a)(l - B) > 0. 



9.2. The Sequential Probability Ratio Test 551 



Similarly, if a > a*, we can show that /3* > (3 and 



□ 



Example 9.36. Suppose that Xi ~ Ber(0) given 6 = 0. Suppose that 9 G 
{0.25,0.75} and # : 6 = 0.25. Then 



= log 



MX 
/o(* 



<> I 



-log 3 
log 3 



There will be no overshoot of the boundaries if a 
fci and fa integers. Here are some examples: 



if Xi = 0, 
if Xi = 1. 

= k\ log 3 and b = — A: 2 log 3 for 



fcl 


*2 


a 


/? 


A 


B 


1 


1 


0.25 


0.25 


3 


0.3333 


2 


2 


0.1 


0.1 


9 


0.1111 


2 


1 


0.077 


0.308 


9 


0.3333 


1 


2 


0.308 


0.077 


3 


0.1111 


3 


3 


0.036 


0.036 


27 


0.0370 



Suppose that we choose the level 0.1 test with ki = fc 2 = 2. We could calculate 
the mean of TV , the expected number of observations needed. It is clear that N 
is even and that N = 2k if and only if the first 2k - 2 observations come in pairs 
0, 1 or 1, 0 and the last two are 1, 1 or 0, 0. So, 



Po(N = 2k) = 0.625 x 0.375 



k= 1,2,.. 



2k) = 3.2. It is easy to see that 



06 (x) = < 



1 

0.5 
0 



It follows that E o (A0 = £~ 2fcP 0 (iV 
Ei(AT)= 3.2 also. 

To compare this to a fixed sample size procedure, it takes n = 6 to have 

? 7/ -.° 1035 ( not q uite as good as the sequential procedure). The test has 
test function 

if E- =1 ^e {4,5,6}, 
« Eli** = 3, 
if El^i€ {0,1,2}. 
This test takes nearly twice as many observations and has higher error probabil- 

the sequential procedure will need more observations. But this will only occur 
for data sets m which ^randomizes. In fact, the two tests make almost all the 
HpIZSW ° n n X obse u rvati ° nS - The ° nly disa g r eements come when two 
LT* 2 y T ° S " When tW ° 08 are followed b y *™ Is, although <j> 6 

F^Z^^iTl W . hGn th u SeqUential pr ° Cedure makes a terminal deci- 
sion. For example, if the first six observations are 0,1,1,1,0,0, then the sequential 
procedure would reject H after four observations, but 0 6 would randomize. 

P ^ Se I a Ba r Slan believed that Pr ( 8 = 0-25) = 0.5 before seeing any 
data, lhen, after n observations with x successes, 



Pr 



( 9 = 0.25 J2 Xi = 
\ i=i 



0.5 x 0.25*0.75* 



0.5 x 0.25*0.75"-* 4- 0.5 x 0.75*0.25™ 
= (l + 3 2x - n )-\ 



552 Chapter 9. Sequential Analysis 



For even n, x = 1 + n/2 leads to a posterior probability of 0 = 0.25 equal to 0.1. 
Similarly, x — n/2 — 1 leads to a posterior probability equal to 0.9. The SPRT 
with a = f3 = 0.1 turns out to be to reject H as soon as the posterior probability 
that H is true falls to 0.1 and accept to H as soon as it rises to 0.9, if we have 
equal prior probabilities to start. 

An interesting calculation can be done in Example 9.36. We found that 
Eo(JV) = 3.2. Notice also that 

Eq(S n ) = 0.1 x 2 log 3 + 0.9 x (-2 log 3) = -1.6 log 3. 

Since Eo(Z<) = 0.25 log 3 - 0.75 log 3 = -0.5 log 3, we see that E 0 (S N ) = 
Eo(Zi)Eo(AT). It is as if AT were fixed in advance! 

Theorem 9.37 (Wald's Lemma). Let {Z n }£Li be IID such that E(Z 4 ) 
exists. Let N be a stopping time such that E(N) < oo. If Sn = YliLi Z%, 
then E(S N ) = E(Z<)E(JV). 

Proof. We can write S N - £^° =1 Z n J{ n)n + 1> ...}(iV). Now write 
E(flW) = E^fj (N)\ 

= E ^f;Z+/ Kn+1 ,... } (iV)j -E (^£z-I [M ... } (N)j 

OO 

n,n+l,... 

n=l 

Since /{ n ,n+i,...}(N) = 1 ~ J{o,i, ...,n-i} is a function of Z u • • • ,Z n _i, it is 
independent of Z n . Hence 

E(Z n / {n , n+1 ,... } (iV)) = E(Z n )Pr(AT > n) = E(Z0Pr(iV > n). 

It follows that E(SW) = £~ i E(Zi) Pr(iV > n) = E(Z x )E(iV). □ 
Wald's lemma can be used to help approximate the expected value of 
N under distributions other than the hypothesis and alternative. If we 
approximate by assuming that there is no overshoot, then 



_ ( a if reject i/, 
bN ~ \ 6 if accept H. 



All we need to complete the approximation is Pr (reject H). 

Lemma 9.38. // {X n }£Li are IID with distribution P, and there exists 

h^0 such that 



9.2. The Sequential Probability Ratio Test 553 

then, to the approximation of no overshoot, for the SPRT{B,A), 

1 -B h 



P {reject H) = 



A h -B h ' 



Proof. If h > 0, consider the SPRT(B h ,A h ) as a test of the hypothesis 
that the density of each observation (with respect to P) is 1 versus the 
alternative that the density is (/i//o) h - The likelihood ratio is L* = L£ 
and B h < L* < A h if and only if the original likelihood ratio satisfies 
B < L n < A. So 

P(reject H) = P(L N > A) = P(L^ > ^) « j^^. 

If /i< 0, consider the SPRT(A-\ B~ h ) as a test of H* that the density is 
(/i//o) /l versus the alternative that the density is 1. Then the likelihood 
ratio is L* = L~ h and 



P(rejectff) = P(Lw > A) = P{L* N < B*) 



i-h 



( B~ h -l \ 
\B-*-A- h ) 



\-B h 
A h -B h ' 



□ 



Using the no-overshoot approximation, 



E(S N ) = aP(reject H) + 6P(accept H) 
= b + (a - 6)P(reject if) 



So, if E(Zi) # 0, we get 

E(iV) ss 



log(B)+log(|) 
E(Z0 



Example 9.39 (Continuation of Example 9.36; see page 551). Suppose that 
{X n }n=i are IID J3er(0.6), but we are testing H : 0 = 0.25 versus A : 6 = 0.75. 
We have 



= f | ifx = 0, 
/ 0 (x) \ 3 ifx = l. 



If h = -log 3 (1.5) = -0.36907, then 0.4(l/3) h + 0.6 x 3 h = 1. We can now 
calculate 

(1 _ ( i ) -°- 36907 \ 
9 -o,Jor (i)-^ ]- 0.8451, 

E 0 . 6 (^i) = 0.6 x log(3) - 0.4 x log(3) = 0.2197, 



554 Chapter 9. Sequential Analysis 



Notice that the mean stopping time is longer when 0 is between the hypothesis 
and the alternative. 

If {X n }£° =1 are IID £?er(0.5), then h = 0 is the only value that works in the 
equation in Lemma 9.38. Hence, Lemma 9.38 has nothing to say about this case. 

The following result has a proof similar to that of Wald's lemma, but 
applies to the case not yet handled in Example 9.36. 

Proposition 9.40. Suppose that {Z n }%L x are IID, with E(Z*) = 0 and 
E(Zf) = a 2 . Suppose that N is a stopping time such that E(N) < oo. Then 
E(S%) = o*E(N). 

Example 9.41 (Continuation of Example 9.36; see page 551). If {X n }^=i are 
IID £er(0.5), then E(Zi) = 0 and E(Zf) = (log(3)) 2 = 1.2069. Also, E(S%) = 
4(log(3))* = 4.8278. It follows that E(N) = 4. Of course, this example is simple 
enough that we could calculate E$(N) for all 9 without any of these theorems. 
See Problem 9 on page 568. 

The SPRT has an optimal property in terms of expected sample size 
which follows from its being a Bayes rule in a sequential decision problem. 
This is very much like the Neyman-Pearson fundamental lemma 3.87 in 
which a minimal complete class of Bayes rules was found in the fixed sample 
size problem for a simple hypothesis and a simple alternative. 

Lemma 9.42. Suppose that 0 < 71 < 72 < 1 and that fo and f\ are two 
different densities with respect to a measure v. There exist 0 < w < 1 and 
c> 0, such that for every 7 € [71,72], the SPRT(B,A) with 

B ^ 7 1-72 A _ 7 1 ~ 7i 
1-7 72 ' 1-7 7i 

is a Bayes rule in the sequential decision problem with action space N = 
{0, 1} x {1, 2, . . .}, parameter space {0, 1} ; prior distribution Pr(B = 0) = 7, 
and loss function 

£(/*,0,n)) = cn+| Q otherwis6j 
where wq = 1 - w and w\ = w. 

Proof. First we will find the general solution of the sequential deci- 
sion problem, and then we will show that there is one whose solution is 
SPRT(B,j4). To put the problem in testing form, let tl H = {/(>}• Suppose 
that a sequential test is 6 = (0, N). Define 

ao(«) = E O (0PO), ai(tf) = 1 - Ei(0(X)). 

Then the Bayes risk of 6 with respect to prior probability 7 = Pr(0 = 0) 
is 

p^S) = 7(w 0 a 0 (6) + cEo[N(X)]) + (1 - 7)(«'i«i(«) + cEi[N(X)]). 



9.2. The Sequential Probability Ratio Test 555 



Define, for each 0 < 7 < 1, 

£/(7) = infp(7,«). 

Since N(x) > 1 for all x, it follows that 1/(7) > 0 for all 7. Since ^(7, <5) is 
a positive linear function of 7 for each 6, it follows that U is the infimum of 
a collection of positive linear functions; hence, it is concave and continuous 
on (0, 1) and positive at the two endpoints. Define 

n 

fiA x ) = n^^)' 

.7 = 1 

for i = 0, 1 and n = 1, 2, . . ., where x = (xi,X2, . . .)• The posterior proba- 
bility of {© = 0} given X\ = x\ , . . . , X n = x n is 

~ ( X \ = ifoM*) = 1 

7n W 7/0,n(*) + (1 " 7)/l,n(x) 1 + i^L n (x) ' 

where L n (x) is the likelihood ratio after n observations (used in every 
SPRT). After observing X\ = a?i, . . . , X n = x n , the posterior mean of the 
loss to be incurred if N = n is h{^ n {x)) + cn, where 

fc(7) = min{wo7, ~ 7)}- 

The posterior mean of the loss to be incurred if N > n is at least U (7n(#)) + 
cn. Hence, the Bayes rule is to continue sampling so long as /i(7 n (x)) > 
C/(7 n (x)) and to stop at N(x) equal to the first n such that /i(7 n (x)) < 
U(^ n (x)). Note that ^(7) is continuous, has a graph shaped like a triangle, 
and satisfies h(0) = h(l) = 0. Since U is concave, it follows that ^(7) > 
U (7) for 7 in some interval (g\ , 02)- Figure 9.43 shows the U and h functions 
for a typical example. 6 (If h^) < U(i) for all 7, define g\ = Q2 to be 
the value of 7 at which h is maximized.) Hence, the Bayes rule continues 
sampling so long as 

R 7 1-02 ^ r , , 7 l-^i . . 

= z < L n (x) < = A*\ 

1-7 92 I-7 pi 

it rejects H if Ln(x) > A+\ and it accepts H if Ln{x) < B*. Therefore, 
the Bayes rule is SPRT(£*, A+). 

The Q\ and #2 found above depend on the particular decision problem 
only through w and c, where we assume that w\—w and wq = 1 — w. This 
is true because the functions h and U depend on the decision problem only 



6 The example in Figure 9.43 has /o being the £er(0.6) distribution and /1 
being the Ber(0.S) distribution. Also, wo = 0.4 and c = 0.02. In this example, 
gi = 0.31 and g 2 = 0.48. 



556 Chapter 9. Sequential Analysis 



« - 





0.4 0.6 

Prior Probability 



Figure 9.43. Typical U and h Functions for SPRT 



through these values. So we call the two values g\(w,c) and g 2 (w,c). To 
finish the proof, we need only find c and w so that 7, = gi(w,c) for both 
i = 1,2. Define 



0(w,c) 
\(w,c) 



\-g 2 {w,cY 

g\{w,c) 
0(w,c)(l-gi(w,c))' 



It is easy to see that g\ and g 2 are also functions of 0 and A. Set 



7i 1-72 
l-7i 72 



00 = 



72 



1-72 



Then, we need to find w and c so that A(tu,c) = A 0 and 0(w,c) = 0o- 
As c I 0, U{0) and U(l) both approach 0, hence g\{w,c) tends to 0 and 
g 2 (w, c) tends to 1 as c j 0 for fixed w. Hence, for fixed w, lim ci0 A(w, c) = 0. 
Since p(7,<5) increases as c increases for every 7 and 6, it follows that 1/(7) 
increases as c increases for every 7. Hence g\{w,c) is a decreasing function 
of c and g 2 (w,c) is increasing in c for fixed w. It follows that A is strictly 
increasing in c for fixed w. As c -> 00, eventually, M7) ^ for all 7. 
Let coH = inf{c : ^Kc) = g 2 (w,c)}. Then lim cTco(ti; ) A(w,c) = 1. Since 
0 < Ao < 1, there exists a unique c = c(w) such that \(w,c) = A 0 . As 
w approaches 0 for fixed c, C/ approaches the constant function c and h 
approaches the constant function 0 while the peak of the triangle in the 
graph of h moves toward 7 = 0. Hence gi(w,c) and g 2 (w,c) approach 0. 



9.2. The Sequential Probability Ratio Test 



557 



Also co(w) and c(w) approach 0 as w approaches 0. Hence 
limg2(w,c(w)) = 0, lim /3(w,c(w)) = 0. 

As w approaches 1 for fixed c, U approaches the constant function c and 
h approaches the constant function 0 while the peak of the triangle in the 
graph of h moves toward 7 = 1. Hence gi(w,c) and #2(w, c) approach 1. 
Also Co(w) and c(w) approach 0 as w approaches 0. Hence 

lim g2(w,c(w)) — 1, lim (3(w,c(w)) = oo. 

Since f3(w,c(w)) is continuous in w, there exists w such that (3(w,c(w)) = 
/Jo- 5 □ 

We are now in position to prove a theorem of Wald and Wolfowitz (1948) 
that says that the SPRT has the smallest expected sample size of all tests 
with the same error probabilities. Since fixed sample size tests can be con- 
sidered as sequential tests in which N is not a function of x, we state the 
theorem in terms of sequential tests only. 

Theorem 9.44. Let fl H = {/o} and D, A = {/i}. Let B 0 < 1 < A 0 . 

Let 6 0 = (<£o, N 0 ) be SPRT(B 0 ,A 0 ), and suppose that Eo(<fo{X)) = a 
and 1 — Ei(</>o(X)) = /?. Among all sequential tests 6 = (<f>,N) for which 
N>1, Eq(4>(X)) < a, 1 - Ei(0(X)) < 0, and Ei(N) < oo for i = 0, 1, So 
minimizes E»(iV) /or i = 0, 1. 

Proof. Let 6 = (0, AT) be a sequential test of H versus A with E 0 (<j)(X)) = 
ai < a, l-Ei(0(X)) = ft < /J, and Ei(JV) < oo for i = 0, 1. Pick 0 < 7 < 1 
and define 

7 7 
7i = TTi r - ; > 72 = 



^0(1-7)4-7' B 0 (l-7) + 7* 

Then 0<7i<7<72<1 and 

7 I-72 7 1-71 

#0 = " , A 0 = . 

1 ~ 7 72 1 - 7 7i 

Lemma 9.42 says that there exist 0 < w < 1 and c > 0 such that 60 
is a Bayes rule in the sequential decision problem with action space N = 
{0, 1} x {1, 2, . . .}, parameter space {0, 1}, prior distribution Pr(6 = 0) = 7, 
and loss function 

where wq = 1 — w and w\ = w. It follows that 

y{w 0 a 0 + cE 0 N 0 ) + (1 - 7)(wiA) 4- cE x N 0 ) 
< 7(^0^1 + cE 0 AT) + (1 - 7)(wij3i + cEiAT). 



558 Chapter 9. Sequential Analysis 



Since c*o < ai, (3q < c > 0, wq > 0, and w\ > 0, it follows that 
7(E 0 iVo - E 0 N) + (1 - 7 )(EiiVb - EiAT) < 0. 

Since this is true for every 7 € (0, 1), the limit as 7 goes to 0 or to 1 
of the left-hand side is also less than or equal to 0. These two limits are 
respectively E 0 JVo - E 0 iV and E X N 0 - E\N. □ 
The reason for the extra conditions that Ei(AT) < 00 for i = 0, 1 is that 
there may be another test with Eo(AT) = 00 and very small value of Ei(JV). 



9.3 Interval Estimation* 

In Section 5.2.5, we gave examples of loss functions associated with interval 
estimation. For example, let the (terminal) action space be the set of all 
pairs of real numbers (a, b) with a < b. We could set 

L\e,(a ) b)) = k{b-a) + \-I [aM { 9 {8)), 
(a, 6)) = k(b - a) 2 + 1 - ![«,&] fo(*)), 



or 



or 

( a -g(0) if 0(0) < a, 

L'(0, (a, &)) = *;(&- a) + I g(0) - b it g(0) > b, 

[ 0 otherwise, 

where k > 0. The first two loss functions above lead to intervals with equal 
posterior density at a and 6. The third one leads to intervals with equal 
posterior probability below a and above b. Another alternative is to let 
W = M and set 

L'(e,a) = l-I la - d , a +d](9(0)), 

where d > 0 is some fixed half-width for the interval. In this case, the 
interval has a fixed width and the coverage probability is determined (along 
with the center of the interval) by the data. 

Example 9.45. Suppose that {X n }£Li are IID 7V(/x,<r 2 ) given 6 = (^a). Let 
g(0) = /jl. Suppose that we want an interval of half-width d and the cost of each 
observation is c. Suppose that the prior is conjugate. The posterior distribution 
of M after n observations will be * an (Mn,W[anA n ]), where the posterior hyper- 
parameters a n ,6 n ,Mn, and A n are as in Example 9.19 on page 544 The optimal 
decision upon stopping is /x n . The terminal risk (not counting cost of observation) 
is 

Po(a n ,^n,Mn, A n ) = 2 




*This section may be skipped without interrupting the flow of ideas. 



9.3. Interval Estimation 559 



where T 0n stands for the CDF of the *a n (0, 1) distribution. To implement the 
one-step look-ahead rule, we need to calculate 



pi (an, fen, Mn, A n ) = min< 2 



c + 2 



where 

Y = (X„ +1 - M n) . ~ *a n (O, -M • 

Even this formula requires numerical integration to compute. For example, sup- 
pose that we use a prior with ao = 2, 60 = 7, /io = 1, and Ao = 1. The cost of each 
observation is 0.005 and the half- width of the interval is d = 1.96. Table 9.47 con- 
tains some data along with po and p\. The terminal decision is fig = —1.011531, 
and so the interval is [—2.9715, ,0.9485]. The posterior probability in the interval 
is 0.9810. The probability is so high because the cost of each observation is so 
low. If the cost had been 0.01 instead, the one-step look-ahead rule would have 
stopped after seven observations with the interval [—3.0214, 0.8986] and posterior 
probability 0.9640. 

The classical approach to fixed-width interval estimation has tradition- 
ally not been through a loss function. Rather, one requires that sampling 
continue until one has an interval of a fixed width with the desired confi- 
dence coefficient or greater. No cost of observation is taken into account, 
except possibly for comparing different procedures. 

The most naive procedure would be to compute a fixed sample size co- 
efficient 7 confidence interval for each n, and stop at the first n such that 
the interval has half-width at most d. 

Definition9.46. Let {X n }£°=i be conditionally IID N(^a 2 ) given 6 = 
(Ax, a). Let X n = X,/n and S 2 n = - X n ) 2 /(n - 1) for each 

n. Let N\ be the smallest n > 2 such that S n T~_\(l - a/2) < y/nd, 



1-T a . 



i-ET an+1 



'ft 

( / (On + l)(A w + l)" \l1 

[ d y 6 n( i+y*) )\y 



Table 9.47. One-Step Look-Ahead Rule for Example 9.45 



i 


Xi 


Po 


Pi 


0 




0.4047363 


0.2816774 


1 


-1.745003 


0.2396304 


0.1774288 


2 


0.793758 


0.1178335 


0.0881298 


3 


-4.385832 


0.1475386 


0.1190706 


4 


-4.708225 


0.1268197 


0.1053043 


5 


-0.363233 


0.0797290 


0.0675238 


6 


1.162189 


0.0599002 


0.0521162 


7 


-0.244931 


0.0360019 


0.0330767 


8 


1.455418 


0.0265989 


0.0258814 


9 


-3.079450 


0.0189839 


0.0189839 



560 Chapter 9. Sequential Analysis 



where T n _i is the CDF of the £ n -i(0, 1) distribution. Then the interval 
[Xjvi — d, Xni + d] is called the naive coefficient 1 — a sequential confidence 
interval for M. 

It is not surprising that the naive confidence interval does not have cov- 
erage probability 1 - a. That is, Pe(X Nl - d < \i < Xn 1 + d) ^ 1 - a. We 
can write an expression for the coverage probability, however: 

P e (X Nl -d<n<X Nl +d) (9.48) 

oo 

= £ft(^n-d<M<Xi + d|Wi =n)P e (Ni =n). 

n=2 

First, recall that JVi is the first n > 2 such that S£ < A: n for some sequence 
{fcn}^. We will prove that the event {Ni = n} is independent of X n . 
This will allow some simplification in (9.48). It is sufficient to show that 
&h • - - ? $n are a ^ independent of X n . For each k = 1,2,..., consider the 
k x k matrix Tk whose rows are all unit vectors and whose ith row (for 
i < k) is proportional to the vector with 1 in the first i places and —i in the 
(i + l)st place. The feth row is proportional to the vector of all Is. It is easy 
to see that these rows are orthogonal to each other, so T/t is orthogonal 
for each k. For i < n, define Wi to be the inner product of the ith row 
of T n with (Xi, . . . ,X n ). 7 Note that the innerj)roduct_of the last row of 
Tk and the vector X k = (X x ,...,X k ) is y/kX k . So, X n is independent 
of W u ...,W n -i- Also, since ||r fc X*|| = it follows that S% = 



THZlWf for k = 2,3,. 
means that we can write 



Hence X n is independent of 52, . . . , S n - This 



Pe(X n - d < n < X n + d\Ni = n) = P e (X„ -d < n < X n + d) 



= 2 



Hence, 

oo 

P«QE Nl -d<n<x Nl +d) = Y, 2 



71=2 



1_$ 



Pb(Nx = n). 



Some numerical method would be required to calculate Pe(Ni = n), but 
the argument above can be used to show that it depends on 6 only through 
dja. 

It should be noted that a Bayesian does not have the same problem with 
naive posterior probability intervals. So long as the decision of whether or 
not to take more data is a measurable function solely of the available data, 



7 Note that, for fixed i, Wi is the same no matter which n > i is chosen for its 
definition. 



9.3. Interval Estimation 561 



the posterior distribution of a parameter given the data does not depend 
on whether or not any more data will be taken. Hence, the Bayesian can 
declare, after every observation, what is the posterior probability that the 
parameter lies in any set he or she wishes. If, after n observations, there is 
an interval of half-width d such that the posterior probability is 1 - a that 
M is in that interval, the Bayesian can declare that and stop sampling. 

There are some classical procedures that do actually have the desired cov- 
erage probability. We will present one such procedure here. It is a two-stage 
sampling procedure. One chooses an initial sample size no and estimates 
E 2 by Sn Q . Then one collects a second sample whose size depends on S 2 Q . 
Define 

d r \si 

c = — i — 7- -r, YV 2 = max 



^■-.(i-f)' 

where L^J denotes the largest integer less than or equal to z. Use the interval 
centered at X^ 2 with half- width d. 

Lemma 9.49. With the above notation, the conditional distribution of 



^N 2 ^fUi (9.50) 



Sn 0 



given 6 = (//,<r) is £ no -i(0, !)■ 
Proof. We can write 



l / n ° ^ 2 



<i=l i=no+l 



We know that Y^lli^i and {X no+ i}g 1 are independent of S 2 Q . Since 
N2 is constant conditional on S 2 0 , we have that the conditional distribu- 
tion of Xn 2 given 5 2 0 is i\r(/i,a 2 /Ar 2 ). So, the conditional distribution of 
y/N2(XN 2 — is N(0, 1), which is independent of S* 0 . It follows that 
(9.50) has tno-i(0, 1) distribution. □ 
Technically, we could use the interval with half-width 

W2 Sn ° T "° 1 - 1 ( X " f ) ' 

but this should be pretty close to d. 

The problems with this procedure derive from the first stage of sampling. 
First, there is the question of how to choose no, which turns out to be 
crucial to the performance of the method as we will see in Example 9.52. 
Second, there is the fact that the estimate of E 2 is based only on the first 
no observations. In order to get a good estimate, no must be large, but 
then we might sample too much if E 2 is small. If we choose no too small, 
then c is small and iV 2 is large. 



562 Chapter 9. Sequential Analysis 



Table 9.51. Classical Fixed- Width Confidence Interval Sample Sizes for Exam- 
ple 9.52 



no 


o2 
°n 0 


c 


N 2 


2 


3.222653 


0.0238 


136 


3 


6.707906 


0.2075 


33 


4 


6.616990 


0.3793 


18 


5 


5.885602 


0.4983 


12 


6 


6.462292 


0.5814 


12 


7 


5.625235 


0.6416 


9 


8 


5.809566 


0.6870 


9 


9 


5.561759 


0.7224 


9 



Example 9.52 (Continuation of Example 9.45; see page 558). Suppose that we 
desire a classical sequential confidence interval with half- width 1.96 and coefficient 
0.95. We will use the same sequence of data as we had earlier. To correspond more 
closely with a classical analysis, we should change to an improper prior. In this 
case, the interval based on the first nine observations is the first one to have 
posterior probability greater than 0.95 and so this would be the naive sequential 
interval. For various values of no, we can implement the classical procedure; the 
results are summarized in Table 9.51. If no = 7,8, or 9, we will get essentially 
the same result as the naive interval. 



9.4 The Relevance of Stopping Rules 

In Section 9.1, we introduced sequential decision problems and examined 
rules that decide optimally after stopping. In Problem 2 on page 567, the 
reader is asked to prove that the formal Bayes rule in a sequential decision 
problem decides optimally after stopping. What exactly is the meaning of 
deciding optimally after stopping? In words, it means that after the stop- 
ping rule says to take no more observations, we then behave exactly as we 
would if we had observed whatever data we now have under a fixed sample 
size scheme. This is in stark contrast to classical sequential procedures like 
the level a SPRT. When the SPRT stops at N = n, the terminal decision is 
most definitely not the same as that of a level a test based on a fixed-sized 
sample of size n. On the other hand, when the SPRT is viewed as a formal 
Bayes rule, it is true that the terminal decision is exactly the same as what 
the formal Bayes rule would be after observing a fixed-sized sample of size 
n. 

In light of Problem 2 on page 567, it makes perfect sense for the Bayesian 
statistician to make the same decision after stopping as he or she would have 
made had the data arrived in a nonsequential fashion. Why, then, does the 
classical statistician behave differently in the two situations? The easiest 
way to answer this question is to see what would happen if the classical 
statistician tried to use a fixed sample size terminal decision after stopping. 



9.4. The Relevance of Stopping Rules 



563 



For a simple example, suppose that {X n }%L x are IID JV(0, 1) given 0 = 0. 
We will consider the problem of testing the hypothesis H : 9 = 0o versus 
A : 0 ^ 0 O . Let X n be the average of the first n of the Xi. Given 0 = 0 O , 
y/n(X n - 0o) has AT(0, 1) distribution for every n. Hence, for every c, 

Pl 0 (MXn-e 0 )>c) = l-$(c)>0. 

It follows that 

P' eo (limsup y/n(X n - 6> 0 ) > c ) > 0. 

\ n— +00 / 

However, the event {limsup,,.^ y/n(X n - 0 O ) > c} is in the tail a-field C, 
so the Kolmogorov zero-one Law B.68 says 

P' eo (limsup VnfXn - 0o) > c) = 1. 

n— ►oo 

Similarly, for every c, 

fg 0 (liminf v/n(^n - 0o) < c) = 1. 

n — ►oo 

Hence, given 0 = 0o, for every c, the probability is 1 that there will exist 
n such that \y/n(X n - 0o)| > c. Let 

iV = inf{n:|Vn(X n -0 o )|>c}, 

where c is the 1 — a/2 quantile of the standard normal distribution. It 
follows that AT is a stopping time. Suppose that a classical statistician were 
to use this stopping time. Suppose that after observing N = n, he or she 
were to use a terminal decision that was the usual level a test of H : 0 = 0o 
versus A : 0 ^ 0o based on a sample of size n. This test would be to reject 
H if \y/n(X n - 0o ) | > c. This person would reject the hypothesis with 
probability 1 given 0 = 0 O . This would not be a level a sequential test. 
Clearly, from a classical viewpoint, the terminal decision rule has to depend 
on which stopping time was used and not just the observed data. 

The phenomenon just illustrated is often called "sampling to a foregone 
conclusion." That is, the stopping time is designed so that, if a fixed sample 
size terminal decision is used upon stopping, the conclusion to be drawn 
is determined in advance. A good deal of discussion of this phenomenon 
exists in the literature. It is particularly pertinent to clinical trials in which 
researchers would like to stop before the original study plan is finished if 
the results seem overwhelming. [See, for example, Cornfield (1966).] The 
concern is raised that this might allow unscrupulous researchers to sample 
to a foregone conclusion. The methods of sequential analysis are designed 
to prevent that, at least in the classical setting, by making the terminal 
decision rule depend on which stopping time is used. Prom a Bayesian point 



564 Chapter 9. Sequential Analysis 



of view, so long as the stopping time is a function of the observed data, and 
one conditions on the observed data, the terminal decision rule should be 
whatever would be optimal if that data were observed from a fixed sample 
size rule. If this is so, can a Bayesian be tricked into sampling to a foregone 
conclusion? The answer is no, if the Bayesian uses a proper prior. 8 

To see why a Bayesian cannot sample to a foregone conclusion, 9 suppose 
that AT is a strictly positive, 10 integer- valued random variable that might 
equal oo. Suppose also that AT is a function of observable data {X n }^ =l 
in such a way that, for every finite n, I{ n }{N) is a function of Xi, . . . , X n . 
Let 

B n = {(xi,...,x n ) : N = n}. 

Let Z be a random variable of interest (perhaps the indicator of some subset 
of the parameter space or anything else) with finite mean c. Suppose that, 
for every n and every (xi, . . . ,x n ) G B ni E(Z\Xi = Xi, . . . , X n = x n ) > 
d > c. (A similar argument works for d < c.) It follows from the law of 
total probability B.70 that 

oo 

E(Z) = ^E(E(Z|X!,...,X n )|iV = n)Pr(Ar = n) (9.53) 

71=1 

+ E(E(Z\X U X 2 , . . .)| AT = oo) Pr(N = oo). 

If we suppose that Pr(iV = oo) = 0, then the right-hand side of (9.53) is 
greater than d and the left-hand side equals c < d, which is a contradiction. 
Hence, Pr(AT = oo) > 0. This means that, if a Bayesian does not believe 
a priori the conclusion (in this case that the mean of Z is greater than c), 
then it cannot be guaranteed that he or she will believe it after sequential 
sampling. 

A more useful result is available if Z = Ib(Y) for some random variable 
Y. In this case, we can calculate a bound on the conditional probability of 
stopping given Y g B. First, note that E(Z) = Pr(Y € B) = c, and rewrite 
(9.53) as 

oo 

Pr(y E B) > E ( Pr ( r e B \ X ^ • ' - X ^\ N = n ) Pr ^ = n) * 

n=l 

The right-hand side of this is greater than dPr(N < oo). It follows that 



8 The reason that a proper prior is needed is subtle. The argument given below 
depends on the law of total probability B.70. Kadane, Schervish, and Seidenfeld 
(1996) show that when improper priors are viewed as finitely additive proba- 
bilities, sampling to a foregone conclusion is possible because finitely additive 
probabilities do not satisfy the law of total probability. 

9 This argument is like one given by Kerridge (1963). 

10 We saw in Example 9.3 on page 537 that if Pr(N = 0) > 0, then Pr(N = 
0) = L 



9.4. The Relevance of Stopping Rules 



565 



Pr(iV < oc) < c/d. Now, write 



Pr(iV < oo|y i B) = 



Pr(y £ B\N < oo) Pr(AT < oo) 
Pr(r £ B) 



By design, Pr(F £ B|JV < oo) < 1 - d and Pr(F g B) = l-c. Combining 
these results gives 



The claim of (9.54) was made without proof by Savage (1962) with reference 
to a specific example. Cornfield (1966) proves the result in another specific 
example. A similar claim, without an explicit bound, was made by Good 



It should be noted that the proof that a Bayesian cannot sample to a 
foregone conclusion involves probabilities calculated under the joint dis- 
tribution of all random quantities. For example, if 6 has a continuous 
distribution and Z = i#(©) for some set J5, it might be the case that for 
some 0 £ B, P' e {N < oo) = 1 (see Problem 12 on page 569, for example), 
but the set of all such 6 must have small prior probability. 

This discussion is not meant to say that Bayesians can ignore all stop- 
ping rules. All it says is that so long as the stopping rule is a function of 
the observed data 11 and the Bayesian conditions on the observed data, no 
further account need be taken of the stopping rule. An example is given 
by Berger and Berry (1988) of a situation in which one or the other of the 
two criteria is not met. We give a modified version here. Other examples 
are given by Roberts (1967). 

Example 9.55. Suppose that {X n }£Li are IID Ber(6) conditional on 0 = 0 
and O E {0.49, 0.51}. The observations cannot be taken at will; rather they arrive 
according to a Poisson process with rate A(0) conditional on 0 = 0, but the Xi 
are independent of the arrival times conditional on O. Suppose that A(0.49) is 
one observation per second and A(0.51) is one observation per hour. The stopping 
rule will be to sample for one minute and then stop when the next observation 
arrives. 

Suppose that we observe N = 60. If we accidentally consider X\ , . . . , X$q to be 
the observed data and condition on it alone, we will be ignoring valuable infor- 
mation. Specifically, the time it took to observe the 60 values contains valuable 
information about 0. Furthermore, the very fact that 60 observations were ob- 
served contains information about 0, even if we don't know how long it took to 



11 We mean that, for each n, the event {N = n) must be measurable with 
respect to the cr-field generated by Xi, . . . ,X n . In a trivial sense, the stopping 
time is always a function of the observed data if the observed data are defined 
to include all and only those observations with subscripts up to and including 
N. If you know that the observed data are Xi,...,X n , then you know that 
N = n. What is required is that, for each n, if you are merely told the values of 
Xi, . . . , X n (but not JV), you would be able to figure out whether or not N > n. 



Pr(JV < oo|Y £ B) < 



c(l-d) 
d(l-c)' 



(9.54) 



(1956). 



566 Chapter 9. Sequential Analysis 



get them. Also, the criterion that the stopping rule be a function of the observed 
data would not be met in this case, since one cannot tell from looking at the first 
60 Xi values that the experiment would stop at N = 60. One also needs to look 
at the clock (and possibly the calendar). 

This is not to say that one could not make inference based solely on X = 
(Xi, . . . , Xjv) in this case. To obtain the density of X given 6, we first introduce 
On}SJLi , the interarrival times of the Poisson process. Now, {N = n} is a function 
of Yi, . . . ,y n , and we can write (assuming that A(0) has units of "observations 
per second") the conditional joint density of X and Y = (Yi , . . . , Yn) given 0 = 6 
as 

fx,Y\e{x,y\0) = 0 k (l - 0) n -*A(0) n exp(-A(0)t), 

where t = 5Z™ =1 t/i, n is the observed value of N, and k = Y^i=\ Xi - We can inte- 
grate Y out of this to obtain the conditional density of X given 0. 12 To integrate 
out Y for fixed n, transform from yi, . . . , y n to t 1 y n -i, • • > yi- The Jacobian is 1, 
and the ranges of integration are (with t innermost and yi outermost) 

t > 60, 

i-l 

0 < yi < 60 - g/j, for i = n - 1, . . . , 2, 
0 < yi < 60. 



The result of integrating out these variables is 

(mXjO)) 71 - 1 exp(-60A((9)) 
(n-l)l 



So the likelihood is 

/^e^.^d-y-' '^""-;^-^""* . (9.56, 

For example, suppose that the prior for 6 puts probability g on 0 = 0.51. 
Then the posterior probabilities of the two values of 0 are in the ratio: 

/e ' x(Q - 51|a;) = -2-2.778 x 10- 4 exp(59.983)(1.0408) 2fc (2.6689 x 10~ 4 ) n . 
/e t x(0.49|x) l-q 

If, for example, n = 60 and k = 30 are the only observed data and q is not 
essentially 0 or 1, then the posterior probability of 0 = 0.49 will be essentially 



12 An alternative method for calculating the conditional density of X given 0 is 
the following. The conditional joint density of the Xi given 0 = 0 and N — n is 
still that of n IID Ber(0) random variables since N is independent of the values 
of the Xi given 0. The conditional density of N given 0 = 0 is 

/„,e(n|*) = ( ^ 

That is, N is just one plus a Poi(6OX(0)) random variable. So the likelihood 
function for observing X alone would be as given m (9.56). 



9.5. Problems 567 



I because N = 60 is orders of magnitude more likely when the Poisson process 
has rate 1 than when it has rate 2.778 x 10~ 4 = A(0.51). On the other hand, if 
t = 216, 000 (2.5 days) is also observed, then the posterior probability of 0 = 0.51 
is essentially 1. 

A classical statistician could also make inference based solely on X without 
much trouble. First notice that 

/x|e(z|0.49) 

plus a constant. If a level a = 0.05 test of H : 0 = 0.49 versus A : 0 = 0.51 is 
desired, we need to add up the probabilities (given 0 = 0.49) of all k and n values 
with small n and large k until the sum exceeds 0.05. This happens at n = 48 and 
k = 29. So the MP level 0.05 test is 

{1 if n< 48 or if n = 48 and k = 29, 

0.3077 if n = 48 and k = 29, 
0 otherwise. 

If n = 48 and k = 29 are observed and neither prior probability is close to 0, then 
the posterior probability of 0 = 0.51 is essentially 0. In fact the Bayes factor 
(likelihood ratio) drops below 1 when N > 6. So, the evidence is more in favor of 
the hypothesis than the alternative whenever N > 6 is observed, yet the MP level 
0.05 test continues to reject H even when 47 observations have been observed. 
The size of the test that rejects H if and only if N < 6 is 6.3 x 10~ 19 . The type 

II error probability of this test is about the same. 

What happened at the end of Example 9.55 is illustrative of the faulty 
reasoning that leads to the choice of a test by its level. The reasoning is 
that we wish to protect against the more costly error (type I error) so we 
make the probability of type I error small and then choose the test with 
the smallest type II error. What happened in Example 9.55 is that the data 
are so well able to distinguish the two hypotheses that making the type I 
error probability as large as 0.05 makes the type II error probability drop 
to 0. This is just the opposite effect as what was desired. 



9.5 Problems 

Section 9.1: 

1. Take the situation described in Example 9.3 on page 537 and find the 
optimal procedure that takes at most three observations. (It is not the 
truncation of the rule in Example 9.3 to three observations.) Comment on 
why it differs from the optimal rule in Example 9.3 even before the third 
observation. 

2. Prove that the formal Bayes rule in a sequential decision problem decides 
optimally after stopping almost surely. 

3. R*fer to the setup in Definition 9.2. Let C be the collection of all A € B°° 
such that, for every n, A n {N = n} € B". Let Q be the probability on 
(V, V) induced by V from /i. 



568 Chapter 9. Sequential Analysis 



(a) Prove that C is a <x-field. 

(b) For each n and each D € £>, prove that Q n {D\x) = Pr(V r_1 (D)| 
X~\B n )), a.s. 

(c) Prove that, for each D € X>, Qn(£>|z) = PrO^D^X" 1 ^)), a.s. 

4. Prove that a rule computed via backward induction is regular. 

5. Prove Proposition 9.8 on page 541. 

6. *Suppose that V > 0 and lim n — oo Epo(Q(-|Xo, . . . , X n )) = 0. Let 7 n be 

defined by (9.28) where 0 < 70 < p*. 

(a) Prove that 7 n < p* for all n. 

(b) Prove that 

|7n+m(Q)-7nW)| 

< E (|7m(Q(-|X! , . . . , X n )) - 70(Q('|Xl, . . . , X n ))\) . 

(c) Prove that lim n — 00 7n(<3) converges to some quantity 7*(Q) that 
satisfies 

y (Q) = m in{p 0 (Q), E( 7 *(Q(-|*i))) + c}.. (9.57) 

(d) Suppose that 7 i and 7$ both satisfy (9.57) for all probabilities Q. 
Show that 

M (Q) - 72(Q)I < E (W(Q(-|Xi, ...,*«))- 72 . . . , . 

(e) Prove that 7? = 72 • 
Section 9.2: 

7. Stein (1946) proves that if Pr(/i(Xi) # /o(X*)) > 0, then there exist 
c and p < 1 such that Pr(N > n) < cp n for the SPRT. Define Z» = 

logLMXo/MX*)]. 

(a) Prove that Var(Zi) > 0 implies Pr(/i(Xi) ^ fo(Xi)) > 0. 

(b) Find an example in which Pr(/i(Xi) ^ fo{Xi)) > 0 but Var(Zi) = 0. 

(c) Under the conditions of Theorem 9.33, prove that there is a subse- 
quence {n k }kLi and a c and p < 1 such that Pr(N > n fc ) < cp nfc . 

8. Prove Proposition 9.40 on page 554. 

9. Suppose that {X„}~ 1 are IID Ber{6) given 0 = 0. Let N be the first n 
such that |n=i & ~ n / 2 l ^ 2 ' Prove that Ee{N) = 2/l ° + (1 ~ ° ) ] ' 

Section 9.3: 

10 Let {{Xn,Bn)}n=i be a sequence of sample spaces, and let / n ,pn be two 
densities on (* n , B n ) for every n. Let X n : S - * n be a random quantity 
for every n. Define Z n = <7n(X n )// n (X n ). Let P be the probability that 
says that X n has density /„ for every n. Let A; > 0 and AT = mf {n : Z n > 
k). Prove that P(iV < 00) < 1/k. 



9.5. Problems 569 



11.* An alternative to fixed- width confidence intervals is to form a confidence 
sequence. A coefficient 7 confidence sequence for 9 is a sequence of sets 
{Rn}n=i such that P 0 (6 6 i* n , for all n) > 7 for all 0. Let {X n }2° =1 be 
conditionally IID with N(0, 1) distribution given 0 = 0. Use the result 
from Problem 10 above to find a coefficient 7 confidence sequence for B. 
{Hint: Let X n in Problem 10 be (Xi, . . . , X n ) in this problem. Let P be Pe 
and let g n be the prior predictive density of (Xi, . . . , X n ) under a suitable 
prior for G.) 



Section 9.4-' 



12. Suppose that {Xn}^ are conditionally IID 7V(0, 1) given 0 = 9 and that 
0 ~ N(0, 1). Show that, given 0 = 0, for every a the probability is 1 that 
there will exist n such that Pr(0 < 0|Xi, . . . , X n ) > a. 

13. *Suppose that {X n }£° = 1 are conditionally IID AT(0, 1) given 0 = 9 and that 

0 ~ iV(0o,l/A). In this problem, we will prove that, given 0 = 9 > 0o, 
the probability is less than 1 that there will exist n such that Pr(0 < 
#o|Xi, . . . ,X n ) > a > 1/2, so long as Pr(0 < 0o) < ol. In what follows, let 
9 > 0 O and a > 1/2. 

(a) Define Zi — Xi — 9o and ip(h) = E# exp(/iZj). Show that there exists 
h< 0 such that ip(h) = 1. 

(b) Let p = Pr(0 < 9 0 ) < a, and let S n = £? =1 Zi. Prove that Pr(0 < 
0o|Xi, . . . , X n ) > a is equivalent to where 



(c) Prove that c n < ci < 0 for all n. 

(d) Let fi be the N(9 — 0o,l) density (the conditional density of Zi 
given 0 = 9), and let /o = /1 times exp(/ix), where /i was found 
in part (a). Let b > 0, and consider the SPRT(B,A) for testing the 
hypothesis {/o} against the alternative {/1} with A = exp(— ^16) and 
J3 = exp(— hc\). Let Mb be the stopping time of this test. Use Theo- 
rem 9.35 to show that 

Pe(SPRT(B,A) accepts hypothesis) < l^wiW 



exp(hci) — exp(hb) ' 

(e) Define 

N = inf{n :Pr(0<O|Xi,..., X n )>a}, 
M = inf{n : S n < ci}. 

Show that P e (N < 00) < P d (M < 00). 

(f) Prove that 



Pe(M < 00) = lim Pe{SPRT(B,A) accepts hypothesis) < 1. 

b—+oc 



Appendix A 

Measure and Integration Theory 



This appendix contains an introduction to the theory of measure and integration. 
The first section is an overview. It could serve either as a refresher for those who 
have previously studied the material or as an informal introduction for those who 
have never studied it. 



A.l Overview 
A. 1.1 Definitions 

In many introductory statistics and probability courses, one encounters discrete 
and continuous random variables and vectors. These are all special cases of a 
more general type of random quantity that we will study in this text. Before we 
can introduce the more general type of random quantity, we need to generalize 
the sums and integrals that figure so prominently in the distributions of discrete 
and continuous random variables and vectors. The generalization is through the 
concept of a measure (to be defined shortly), which is a way of assigning numerical 
values to the "sizes" of sets. 

Example A.l. Let 5 be a nonempty set, and let ACS. Define fj,(A) to be 
the number of elements of A. Then > 0, /z(0) = 0, and if A\ D A 2 = 0, 
fj,(Ai U A 2 ) = n(Ai) + /i(j4 2 ). Note that n(A) = oo is possible if S has infinitely 
many elements. The measure // described here is called counting measure on S. 

Example A.2. Let A be an interval of real numbers. If A is bounded, let 

be the length of A. If A is unbounded, let fi(A) - oo. It is easy to see that 

/x(IR) = oo, 1 = 0, and if A\ D A 2 = 0 and A\ U A 2 is an interval, then 



By 1R, we mean the set of real numbers. 



A.l. Overview 571 



fi(Ai U A2) — IJ>(A\) + A*(^2)- The measure /1 described here is called Lebesgue 
measure. 

Example A. 3. Let / : 1R — ► fft + be a continuous function. 2 Define, for each 
interval A, n(A) = J A f(x)dx. Then /i(lR) > 0, /z(0) = 0, and if Ai n A 2 = 0 and 
A\ U A2 is an interval, then fi(A\ U A2) = m(^i) + M^)- 

Since measure will be used to give sizes to sets, the domain of a measure will 
be a collection of sets. In general, we cannot assign sizes to all sets, but we need 
enough sets so that we can take unions and complements. A collection of sets 
that is closed under taking complements and finite unions is called a field. A field 
that is closed under taking countable unions is called a a -field. 

Example A.4. Let S be any set. Let A = {5, 0}. This cr-field is called the trivial 
a -field. As a second example, let A C S. Let A = {5, A, A c , 0}. Let B be another 
subset of 5, and let A = {5, A, B, A c , B c , A fl B, A Pi B c , . . .}. Such examples 
grow rapidly. The largest cr-field is the collection of all subsets of 5, called the 
power set of S and denoted 2 s . 

Example A.5. One field of subsets of IR is the collection of all unions of finitely 
many disjoint intervals (unbounded intervals are allowed). This collection is not 
a cr-field, however. 

It is easy to prove that the intersection of an arbitrary collection of <7-fields 
is itself a cr-field. Since 2 s is a cr-field, it is easy to see that, for every collection 
of subsets C of 5, there is a smallest cr-field A that contains C, namely the 
intersection of all cr-fields that contain C. This smallest cr-field is called the cr- 
field generated by C. 

The most commonly used cr-field in this book will be the one generated by the 
collection C of open subsets of a topological space. 3 This cr-field is called the Borel 
cr-field. It is easy to see that the Borel cr-field B 1 for IR is the cr-field generated 
by the intervals of the form [6, 00). It is also the cr-field generated by the intervals 
of the form (—00, a] and the cr-field generated by the intervals of the form (a, 6). 
Since multidimensional Euclidean spaces are topological spaces, they also have 
Borel cr-fields. 

An alternative way to generate the Borel cr-fields of IR fc spaces is by means of 
product spaces. The cr-field generated by all product sets (one factor from each 
cr-field) in a product space is called the product a -field. In IR fc , the product cr- 
field of one-dimensional Borel sets B l is the same as the Borel cr-field B k in the 
fc-dimensional space (Proposition A.38). 

Sometimes, we need to extend IR to include points at infinity. The extended real 
numbers are the points in lRU{oo, —00}. The Borel cr-field B+ of the extended real 
numbers consists of B 1 together with all sets of the form BU{oo}, BU{-oo}, and 
B U {00, —00} for B e B 1 . It is easy to check that B + is a cr-field. (See Problem 4 
on page 603.) 



2 By 1R + , we mean the open interval (0, 00). 

3 A space X is a topological space if it has a collection V of subsets called a 
topology which satisfies the following conditions: 0 G V, X G V, the intersection 
of finitely many elements of V is in P, and the union of arbitrarily many elements 
of V is in V. The sets in V are called open sets. 



572 Appendix A. Measure and Integration Theory 



If A is a <r-field of subsets of a set S, then a measure \i on S is a function from 
A to the nonnegative extended real numbers that satisfies 

• M0) = o, 

• {A n }S° =1 mutually disjoint implies xx(UXi^i) = ]Ci^i A*(^0- 

If /x is a measure, the triple (S, A, /x) is called a measure space. If (S, ^4, /x) is a 
measure space and ix(5) = 1, then /x is called a probability and (5, .A, /x) is called 
a probability space. 

Some examples of measures were given earlier. The Caratheodory extension 
theorem A.22 shows how to construct measures by first defining countably ad- 
ditive set functions on fields and then extending them to the generated cr-field. 
Lebesgue measure is defined in this manner by starting with length for unions of 
disjoint intervals. 

Sets with measure zero are ubiquitous in measure theory, so there is a special 
term that allows us to refer to them more easily. If E is some statement concerning 
the points in 5, and /x is a measure on 5, we say that E is true almost everywhere 
with respect to /x, written a.e. [/x], if the set of s such that E is not true is contained 
in a set A with tx(A) = 0. If /x is a probability, then almost everywhere is often 
expressed as almost surely and denoted a.s. [/x]. 

Example A. 6. It is well known that a nondecreasing function can have at most 
a countable number of discontinuities. Since countable sets have Lebesgue mea- 
sure (length) 0, it follows that nondecreasing functions are continuous almost 
everywhere with respect to Lebesgue measure. 

Infinite measures are difficult to deal with unless they behave like finite mea- 
sures in certain important ways. If there exists a countable partition of the set S 
such that each element of the partition has finite /x measure, then we say that /x 
is a -finite. When an abstract measure is mentioned in this text, it will generally 
be safe to assume that it is cr-finite unless the contrary is clear from context. 

A. 1.2 Measurable Functions 

There are certain types of functions with which we will be primarily concerned. 
Suppose that S is a set with a cr-field A of subsets, and let T be another set 
with a cr-field C of subsets. Suppose that / : S -> T is a function. We say / 
is measurable if for every B € C, € A. When there are several possible 

cr-fields of subsets of either S or T, we will need to say explicitly with respect to 
which cr-field / is measurable. If / is measurable, one-to-one, and onto and / is 
measurable, we say that / is bimeasurable. If the two sets S and T are topological 
spaces with Borel cr-fields, a measurable function is Borel measurable. 

As examples, all continuous functions are Borel measurable. But many dis- 
continuous functions are also measurable. For example, step functions are mea- 
surable. All monotone functions are measurable. In fact, it is very difficult to 
describe a nonmeasurable function without using some heavy mathematics. 

If S and T are sets, C is a cr-field of subsets of T, and / : S — T is a function, 
then it is easy to show that f l (C) is a cr-field of subsets of S. In fact, it is the 
smallest cr-field of subsets of S such that / is measurable, and it is called the 
a -field generated by f. 



A.l. Overview 573 



Some useful properties of measurable functions are in Theorem A.38. To sum- 
marize, multivariate functions with measurable coordinates are measurable; com- 
positions of measurable functions are measurable; sums, products, and ratios of 
measurable functions are measurable; limits, suprema, and infima of sequences 
of measurable functions are measurable. 

As an application of the preceding results, we have Theorem A. 42, which says 
that one function g is a function of another / if and only if g is measurable with 
respect to the <7-field generated by /. 

Many theorems about measurable functions are proven first for a special class of 
measurable functions called simple functions and then extended to all measurable 
functions using some limit theorems. A measurable function / is called simple 
if it assumes only finitely many distinct values. The most fundamental limit 
theorem is Theorem A.41, which says that every nonnegative measurable function 
can be approached from below (pointwise) by a sequence of nonnegative simple 
functions. 



A. 1.3 Integration 

The integral of a function with respect to a measure is a way to generalize the 
Riemann integral. The interested readers should be able to convince themselves 
that the integral as defined here is an extension of the Riemann integral. That 
is, if the Riemann integral of a function over a closed and bounded interval 
exists, then so does the integral as defined here, and the two are equal. We 
define the integral in stages. We start with nonnegative simple functions. If / is 
a nonnegative simple function represented as f(s) = 53 <=1 ail At (s), with the a% 
distinct and the A% mutually disjoint, then the integral of f with respect to fi is 
J /(s)d/x(s) = 5^<=i ^ 0 times oo occurs in such a sum, the result is 0 

by convention. The integral of a nonnegative simple function is allowed to be oo. 

For general nonnegative measurable functions, we define the integral of / with 
respect to \x as J f(s)dfjL(s) = sup g < / g simp \ e J g(s)dfi(s). For general functions /, 
let / + (s) = max{/(s),0} and f~(s) = -min{/(s),0} (the positive and negative 
parts of /, respectively). Then f(s) = f + {s) - f~(s). The integral of / with 
respect to \i is 

J f(s)d t i(s) = J f+(s)dn(s)- J r(s)dn(s), 

if at least one of the two integrals on the right is finite. If both are infinite, the 
integral is undefined. We say that / is integrable if the integral of / is defined and 
is finite. The integral is defined above in terms of its values at all points in S. 
Sometimes we wish to consider only a subset of A C S. The integral of / over A 
with respect to \i is 

Jj(s)d»(s) = J I A (s)f(s)dn(s). 

Several important properties of integrals will be needed in this text. Proposi- 
tion A.49 and Theorem A. 53 state a few of the simpler ones, namely that functions 
that are almost everywhere equal have the same integral, that the integral of a 
linear combination of functions is the linear combination of the integrals, that 



574 Appendix A. Measure and Integration Theory 



smaller functions have smaller integrals, and that two integrable functions that 
have the same integral over every set are equal almost everywhere. Another useful 
property, given in Theorem A. 54, is that a nonnegative integrable function leads 
to a new measure v by means of the equation v(A) = f A f(s)dfi(s). 

The most important theorems concern the interchange of limits with integra- 
tion. Let {/ n }£Li be a sequence of measurable functions such that f n (x) — ► f(x) 
a.e. [/j]. The monotone convergence theorem A. 52 says that if the f n are nonneg- 
ative and f n {x) < f(x) a.e. [/x], then 

\im^ j f n (x)dfi(x) = J /(x)cfyx(x). (A.7) 

The dominated convergence theorem A. 57 says that if there exists an integrable 
function g such that |/n(z)| < <?(x), a.e. [/x], then (A.7) holds. 

Part 1 of Theorem A. 38 says that measurable functions into each of two mea- 
surable spaces combine into a jointly measurable function. Measures and inte- 
gration can also be extended from several spaces into the product space. For 
example, suppose that /ii is a measure on the space (Si,Ai) for i = 1,2. To de- 
fine a measure on (Si x 52, A\ ® -42), we can proceed as follows. For each product 
set A = A\ x A2, define /ii x 112(A) = ^1(^1)^2 (^2). The Caratheodory exten- 
sion theorem A. 22 allows us to extend this definition to all of the product space. 
Lebesgue measure on IR 2 , denoted dxdy, is such a product measure. Not every 
measure on a product space is a product measure. Product probability measures 
will correspond to independent random variables. 

Extending integration to product spaces proceeds through two famous theo- 
rems. Tonelli's theorem A.69 says that a nonnegative function / satisfies 



J f(x,y)dfjLi x n 2 {x,y) = j f{x,y)dpi(x) 
= j^J f(x,y)<h*2(y) 



di*.2(y) 
dni(x). 



Fubini's theorem A. 70 says that the same equations hold if / is integrable with 
respect to /zi x /x 2 . These results also extend to finite product spaces Si x • • • x 5 n . 



A. 1.4 Absolute Continuity 

A special type of relationship between two measures on the same space is called 
absolute continuity. If [i\ and fi 2 are two measures on the same space, we say that 
H2 is absolutely continuous with respect to /xi, denoted /x 2 < Mi» ir * Mi (A) = 0 
implies /z 2 (A) = 0. When /z 2 < Mi, we say that mi is a dominating measure for 
\i2> Here are some examples: 

Example A.8. 

• Let / be any nonnegative measurable function and let \i\ be a measure. 
Define 112(A) = f A f(s)dfn(s). (See Theorem A.54.) Then, ^2 < Mi- 

• Let S be the natural numbers and let ai,a 2 , . . . be any sequence of non- 
negative numbers. Define in to be counting measure on S, and let fj, 2 (A) - 

E ai eA a ^ Then M2«Mi- 



A. 2. Measures 575 



• Let /ii,^2, -. be a collection of measures on the same space (S,A). Let 
ai,a2, . . . be a collection of positive numbers. Then /x = J^aii i a i^i is a 
measure and fii < \i for all z. 

The last example above is important because it tells us that for every countable 
collection of measures, there is a single measure such that all measures in the 
collection are absolutely continuous with respect to it. 

The Radon-Nikodym theorem A. 74 says that the first part of Example A. 8 is 
the most general form of absolute continuity with respect to a-finite measures. 
That is, if /xi is cr-finite and /X2 <C /xi, then there exists an extended real- valued 
measurable function / such that H2(A) = f A /(x)d/xi(x). In addition, if g is 
H2 integrable, then J g(x)dfi2(x) = J g{x)f{x)dyL\{x). The function / is called 
the Radon-Nikodym derivative of fi 2 with respect to /xi and is usually denoted 
(d/x 2 /d/xi)(s). 

A similar theorem, A.81, relates integrals with respect to measures on two 
different spaces. It says that a function f : S\ -> St induces a measure on the 
range 52. If /xi is a measure on Si, then define fi2(A) = /xi (A)). Integrals 
with respect to /X2 can be written as integrals with respect to /xi in the following 
way: J g(y)d/Ji2(y) = / g(f(x))dfii(x). The measure /X2 is called the measure 
induced on S2 by / from /xi . 

A. 2 Measures 

A measure is a way of assigning numerical values to the "sizes" of sets. The 
collection of sets whose sizes are given by a measure is a a -field. (See Examples A. 4 
and A. 5 on page 571.) 

Definition A.9. A nonempty collection of subsets A of a set S is called a field 
if 

• A G A implies 4 A c G A, 

• Ai, A2 G A implies A\ U A2 6 A. 

A field A is called a a -field if {A n }£Li G A implies U^Ai 6 A. 

Proposition A. 10. Let N be an arbitrary set of indices, and let y = {A a : ct G 
N} be an arbitrary collection of a-fields of subsets of a set S. Then OatxAa is 
also a a-field of subsets of 5. 

Because of Proposition A. 10 and the fact that 2 s is a a-field, it is easy to 
see that, for every collection of subsets C of 5, there is a smallest a-field A that 
contains C, namely the intersection of all a-fields that contain C. 

Definition A.ll. Let C be the collection of intervals in H. The smallest a-field 
containing C is called the Borel cr -field. In general, if S is a topological space, and 
B is the smallest a-field that contains all of the open sets, then B is called the 
Borel a-field. 



4 The symbol A c stands for the complement of the set A. 



576 Appendix A. Measure and Integration Theory 



In addition to the Borel cr-field, the product cr-field is also generated by a simple 
collection of sets. 

Definition A.12. 

• Let N be an index set, and let {5 a }aGN be a collection of sets. Define 
S = riaeN ^ e ca ^ S a V T °duct space. 

• For each a G N, let A a be a cr-field of subsets of S a . Define the product 
a -field as follows. <8>aeNA* is the smallest cr-field that contains all sets of 
the form Yi ae ^ A a , where A 

Ot G Act 

for all a and all but finitely many A a 

are equal to 5 a . 

In the special case in which N = {1, 2}, we use the notation S = S\ x S2 and the 
product cr-field is denoted Ai <g> A2- 

Proposition A.13. 5 The Borel a -field B k of M k is the same as the product cr- 
field ofk copies of(R,B l ). 

There are other types of collections of sets that are related to cr-fields. Some- 
times it is easier to prove results about these other collections and then use the 
theorems that follow to infer similar results about cr-fields. 

Definition A. 14. Let 5 be a set. A collection II of subsets of S is called a 7r- 
system if A, B G U implies A H B G II. A collection A is called a A- system if 
S G A, A G A implies A c G A, and {A n }n=i G A with Ai D Aj = 0 for i ^ j 
implies U-Si^i € A. 

As in Proposition A. 10, the intersection of arbitrarily many 7r-systems is a 
7r-system, and so too with A-systems. The following propositions are also easy to 
prove. 

Proposition A. 15. If S is a set and C is a collection of subsets of S such that 
C is a 7r -system and a \-system, then C is a a -field. 

Proposition A. 16. If S is a set and A is a X-system of subsets, then A, AC)B G A 
implies A D B c G A. 

The following lemma is the key to a useful uniqueness theorem. 

Lemma A.17 (ir-\ theorem). 6 Suppose that U is a n-system, that A is a X- 
system, and that U C A. Then the smallest cr-field containing II is contained in 
A. 

PROOF. Define A(II) to be the smallest A-system containing II, and define <t(II) 
to be the smallest a-field containing II. For each ACS, define Q A to be the 
collection of all sets B C S such that An Be X(U). 

First, we show that Q A is a A-system for each A G A(II). To see this, note that 
AD S e A(n), so S e Q A . If # G Ga, then An Be A(n), and Proposition A.16 
says that A fl B c G A(II), so B c G Ga. Finally, {B n }n= 1 € Qa with the B n 



5 This proposition is used in the proof of Theorem A. 38. 
6 This lemma is used in the proofs of Theorems A.26 and B.46 
Lemma A. 61. 



A. 2. Measures 577 



disjoint implies that A n B n € A(II) with A D B n disjoint, so their union is in 
A(n). But their union is An (U~ =1 B n ). So \J%LiB n € Ga. 

Next, we show that A(I1) C Qc for every C € A(I1). Let A,B eH, and notice 
that AC\B € II, so B G Ga- Since Ga is a A-system containing II, it must contain 
A(n). It follows that A n C € A(n) for all C € A(II). If C € A(II), it then follows 
that A € Gc- So, II C Q c for all C € A(II). Since Gc is a A-system containing II, 
it must contain A(II). 

Finally, if A,B e A(II), we just proved that B e Ga, so A D B € A(II) and 
hence A (II) is also a 7r-system. By Proposition A. 15, A (II) is a cr-field containing 
II and hence must contain a(II). Since A(II) C A, the proof is complete. □ 

We are now in a position to give a precise definition of measure. 

Definition A.18. 

• A pair (S,A), where S is a set and A is a cr-field, is called a measurable 
space. 

• A function /x : A — ► [0, oo] is called a measure if 

- M0) = 0, 

- {A n }%Li mutually disjoint implies /i(uSi^i) = Y^Li^^i)- 

• A function \i : A — > [— oo, oo] that satisfies the above two conditions and 
does not assume both of the values oo and — oo is called a signed measure. 7 

• If [i is a measure, the triple (S, A, fi) is called a measure space. 

• If (5, A,fi) is a measure space and /i(5) = 1, then \i is called a probability 
and (£, A, fi) is called a probability space. 

Some examples of measures were given in Section A.l. 

Theorem A.19. 8 // 

is a measure space and {A n }n=i * 5 a monotone 
sequence, 9 then /z(limi—oo Ai) = limi— oo v(Ai) if either of the following holds: 

• the sequence is increasing, 

• the sequence is decreasing and /i(Ai) < oo. 

Proof. If the sequence is increasing, then let Bi = A\ and Bk = Ak \ Ak-i for 
k > l. 10 Then {£n}JS=i are disjoint and the following are true: 

fc oo fc 

M Bi = A k , (J Bi = lim A k , n(A k ) = V /i(Bi) 

i=l i=l t=l 



Signed measures will only be used in Section A. 6. 
8 This theorem is used in the proofs of Theorems A. 50 and B.90 and 
Lemma A. 72. 

9 A sequence of sets {^n}^! is monotone if either A\ C A% C ... or 
A\ D A2 5 — In the first case, we say that the sequence is increasing and 
lim n -. 00 A n = U^iAf. In the second case, we say that the sequence is decreasing 
and limn—oo A n = H^Ai. 

10 The symbol A \ B is another way of saying A n B c . 



578 Appendix A. Measure and Integration Theory 



oo / OO \ 

hm n(A k ) = ]T/x(£i) = /x I (J Bi J = tx [xxrn^A^ . 



If the sequence is decreasing, then let Bi = Ai\ Ai+i, for i = 1, 2, .... It follows 
that 

and all of the sets on the right-hand side are disjoint. It follows that 



fc-i 



A k = 



Ax \ |J B it 

i-l 

oo 

fc-1 

n(A k ) = /x(^i)-]Txx(£i)> 

oo 

lim /i(i4 fc ) = ~ V* = /x ( lim A fc ) . □ 



Another useful theorem concerning sequences of sets is the following. 

Theorem A.20 (First Borel-Cantelli lemma). 11 // Yln=i M^n) < °°> 
then /x U~ =i A n ) = 0. 

PROOF. Let Bi = U^An and B = n^ft. Since B C Bi for each i, it fol- 
lows that v(B) < /x(Bi) for all i. Since ^(#0 < Y^n^i^An), it follows that 
limwoo /i(Bi) = 0. Hence /x(£) = 0. □ 
Theorem A. 22 below is used in several places for extending measures defined 
on a field to the smallest cr-field containing the field. A definition is required first. 

Definition A. 21. Let 5 be a set, A a collection of subsets of 5, and /x : A -+ 
JR U {±00} a set function. Suppose that S = U£i Ai with fi(Ai) < 00 for each i. 
Then we say /x is a-finite. If /x is a a-finite measure on {S 1 A) i then is 
called a a-finite measure space. 

The proof of Theorem A. 22 is adapted from Royden (1968). 

Theorem A.22 (Caratheodory extension theorem). 12 Let /x be a set func- 
tion defined on a field C of subsets of a set S that is a-finite, nonnegative, extended 



11 This theorem is used in the proofs of Lemma A. 72 and Theorems B.90 
and 1.61. There is a second Borel-Cantelli lemma, which involves probability 
measures, but we will not use it in this text. See Problem 20 on page 663. The 
set whose measure is the subject of this theorem is sometimes called A n infinitely 
often because it is the set of points that are in infinitely many of the A n . 

12 This theorem is used to prove the existence of many common measures 
(including product measure) and in the proofs of Lemma A.24 and of Theo- 
rems B.118, B.131, and B.133. 



A. 2. Measures 579 



real-valued, and countably additive and satisfies /x(0) = 0. Then there is a unique 
extension of /jl to a measure on a measure space 13 (S, .4, ax*). (That is, C C A 
and /x(A) = /x*(A) for all AeC.) 

PROOF. The proof will proceed as follows. First, we will define /x* and A. Then we 
will show that /x* is monotone and subadditive, that C C A, that A is a cr-field, 
that /x* is countably additive on A, that /x* extends jx, and finally that /x* is the 
unique extension. 

For each B G 2 s , define 

oo 

p,*(B) = inf ^niAi), (A.23) 

1=1 

where the inf is taken over all {Ai}-^ such that B C U^UAi and Ai G C for all 
i. Let 

A = {B G 2 s : /x*(C) = /x*(Cn £) + /x*(C n £ c ), for all C G 2 s }. 

First, we show that /x* is monotone and subadditive. Clearly, /x*(A) < /x(A) 
for all A G C and Si C 52 implies /x*(£i) < /x*^). It is also easy to see that 
H m (Bi U B 2 ) < + M*(£a) for all B U B 2 G 2 s . In fact, if {B n }~ =1 G 2 s , 

then < XXi M*^)- The P roof is to notice that the collection of 

numbers whose inf is /j,* of the union includes all of the sums of the numbers 
whose infima are the /jl* values being added together. 

Next, we show that CCA. Let A G C and C G 2 s . Since /x* is subadditive, we 
only need to show that /x*(C) > /x*(C fl A) + /x*(C fl A c ). If /x*(C) = oo, this is 
clearly true. So let /x*(C) < oo. From the definition of ix*, for every e > 0, there 
exists a collection {At}£i of elements of C such that $3°^ /x(^0 < + € - 

Since /x(Ai) = /x(Ai fli) + /x(Ai D A c ) for every i, we have 

oo oo 

/x*(C) + c > y $2n(A i nA) + J2»( A inA c ) 
i=i i=i 
> M*(CnA) + /x*(CnA c ). 

Since this is true for every c > 0, it must be that ii*(C) > /x*(CnA) + /x*(CnA c ), 
hence A e A. 

Next, we show that A is a a-field. It is clear that 0 G A and A € A implies 
A c G A by the symmetry in the definition of .4. Let Ai, A2 G A and C G 2 s . We 
can write 

m*(C) = /x*(cnAi) + *x*(CnA?) 

= /(an A x ) + n*(C n Af n A 2 ) + /(Cnifn A?) 
> /x*(C n [Ai u A 2 ]) + /x*(C n [Ax u A 2 ] c ), 

where the first two equalities follow from Ai, A2 G A, and the last follows from 
the subadditivity of ix*. So, A\ U A 2 G A Let {A n }~ =1 G A; then we can write 



13 The usual statement of this theorem includes the additional claim that the 
measure space (5, A, /x*) is complete. A measure space is complete if every subset 
of every set with measure 0 is in the <r-field. 



580 Appendix A. Measure and Integration Theory 



A — Ufli Ai = Ugi^i, where each Bi e A and the Bi are disjoint. (This just 
makes use of complements and finite unions of elements of A being in A) Let 
D n = UJUfii and C € 2 s . Since A c C D£ and £> n € .A for each n, we have 

//(C) = /i*(CnD n )+/i*(Cni) n c ) 

> /i*(CnD n )4-/i*(CnA c ) 



- £ 



/i*(CnBi)+/(CnA°). 

Since this is true for every n, 

oo 

/i* (C) > ^/(Cn Bi) + ijl*(C nA c ) 

i=l 

> /i*(CnA) + /i*(cn/), 

where the last inequality follows from subadditivity. So, A is a <r-field. 

Next, we show that n* is countably additive when restricted to A. If A\^Ai 
are disjoint elements of A, then A\ = (Ai U A2) 0 Ai and A2 = (Ai U A2) n Af . 
It follows that 

/x*(AiUA 2 ) = /z*(Ai)+M*(A 2 ). 

By induction, is finitely additive on A Let A = U^Ai, where each A* 6 
^4 and the A* are disjoint. Since UjLi Ai C A, we have, for every n, /x*(A) > 
SiLi A**(^«)» which implies a**(A) > /i*(Ai). By subadditivity, we get the 
reverse inequality, hence fi* is countably additive on A. 

Next, we prove that /x* extends jz. Since /i* is countably additive on A, we 
can let 5 6 C and {A n }£° =1 G C be disjoint, such that B C U~ = iA n . Then 
B C U~ i(A n fl 5) = B, and < £~ 1 /x(A n flB) = 11(B) % since M is 

countably additive on C. 

To prove uniqueness, suppose that \i also extends \i to .4. Then \i{B) < 
E~=i if B S U^° =1 An. Hence, //(£) < for a11 B € A If there exists 

B such that fi'(B) < /x*(B), let {An}n°=i € C be disjoint and such that /z(A n ) < 
00 and US° = i^ n = 5". Then, there exists n such that \£(B D A n ) < /**(£ 0 A n ). 
Since /x'(A n ) = /x*(A n ), it must be that y!{B c fl A n ) > A^(£ C D A n ), but this is 
a contradiction. D 
Here are some examples: 

• Let S = IR and let B be the Borel a-field. Define /u((a, 6]) = b - a for inter- 
vals, and extend fx to finite unions of disjoint intervals by addition. The- 
orem A.22 will extend /i to the a-field B. This measure is called Lebesgue 
measure on the real line. 

• Let F be any monotone increasing function on IR which is continuous from 
the right. Let 5 = IR and let B be the Borel a-field. Define /x((a,6]) = 
F(b) - F(a). This can be extended to all of B. In particular, if F is a CDF, 
then /1 is a probability. 

In the examples above, the claim was made that /x could be extended to the 
Borel <7-field. To do this by way of the Caratheodory extension theorem A.22, we 
need \x to be defined on a field, countably additive, and ^-finite. For the cases 
described above, this can be arranged as follows. Suppose that /x is defined on 



A. 2. Measures 581 



intervals of the form (a, b] with a = -oo and/or b = oo possible. 14 The collection 
C of all unions of finitely many disjoint intervals of this form is easily seen to be 
a field. If (ai, 61], . . . , (a n , b n ] are mutually disjoint, set 

M^(J(ai,6i]^ =^/z((ai,6i]). 

It is not hard to see that this extension of /x to C is well defined. This means that if 
U£=i (ai, bi] = U-li (ci, di], where (ci, di], . . . , (c m , d m ] are also mutually disjoint, 
then J^jLj M((ai, = 5ZI=i /^((ctj rf*])- W is finite for every interval, then it is 
<7-finite. To see that /x is countably additive on C, suppose that /x(( a > &]) = ^(&) _ 
F(a), where F is nondecreasing and continuous from the right. If {(a n , M}5S=i is 
a sequence of disjoint intervals and (a, b] is an interval such that U%Li(a n ,b n ] C 
(a, 6], then it is not difficult to see that J^^Li M a n,bn]) < /x((a,6]). If (a, 6] C 
U£=i(a n , 6 n ], we can also prove that Mfan, M) > fi((a, b]) (see Problem 7 

on page 603). Together these facts will imply that /x is countably additive on C. 

The proof of Theorem A.22 leads us to the following useful result. Its proof is 
adapted from Halmos (1950). 

Lemma A.24. 15 Let (5, A, /x) be a o-finite measure space. Suppose that C is a 
field such that A is the smallest a -field containing C. Then, for every A € A and 
e>0, there is C € C such that /x(CAA) < e. 16 

PROOF. Clearly, /x and C satisfy the conditions of Theorem A.22, so that \x is equal 
to the /x* in the proof of that theorem. Let A € A and e > 0 be given. It follows 
from (A.23) that there exists a sequence {Ai}^ in C such that A C U^A* and 

M(A)>5>Mi)-f. 
1=1 

Since \i is countably additive, 
so that there exists n such that 

*(u*)<*(u*)+i 

Let C = Ui = iAi, which is clearly in C. Now 



14 If 6 = 00, we mean (a, 00) by (a, 6]. That is, we do not intend 00 to be a 
point in the space S. 

15 This lemma is used in the proof of the Kolmogorov zero-one law B.68. 

16 The symbol A here refers to the symmetric difference operator on pairs of 
sets. We define CAA to be (C n A c ) U (C c fl A). 



582 Appendix A. Measure and Integration Theory 



Similarly, 

»(cf)A c ) ^(y*fv c ) =M (Q Ai ) ~ ti{A) < i 

It now follows that /x(AAC) < €. □ 
Sets with measure zero are ubiquitous in measure theory, so there is a special 
definition that allows us to refer to them more easily. 

Definition A.25. Let E be some statement concerning the points in S such that 
for each point s € S E is either true or false but not both. Suppose that there 
exists a set A e A such that /x(A) = 0 and that for all s € A c , E is true. Then 
we say that E is true almost everywhere with respect to /x, written a.e. [xx]. If /x 
is a probability, then almost everywhere is often expressed as almost surely and 
denoted a.s. [/x]. 

The following theorem implies uniqueness of measures with certain properties. 

Theorem A.26. 17 Suppose that xxi and /12 are measures on (S,A) and A is the 
smallest a -field containing the 7r -system IT. If Hi and \iv are both a -finite on II 
and they agree on U, then they agree on A. 

Proof. First, let C € II be such that /zi(C) = 11.2(C) < 00, and define Qc to 
be the collection of all B € A such that m(B D C) = /x 2 (B fl C). Using simple 
properties of measures, we see that Qc is a A-system that contains II, hence it 
equals A by Lemma A. 17. (For example, if B e Qc, 

/xi {B c n o = mi(C) - m(B n C) = fi 2 (C) - MB nc) = M£ c n C), 

so £ c € Sc) 

Next, if jxi and /x 2 are not finite, there exists a sequence {C n }^=i E II such 
that /xi(Cn) = /x 2 (C n ) < 00, and S = U^° =1 C n . (Since n is only a 7r-system, we 
cannot assume that the C n are disjoint.) For each AG A 

Since /Xj (U?=i[Ci fl A]) can be written as a linear combination of values of fij 
at sets of the form A n C, where C € n is the intersection of finitely many of 
Ci, C n , it follows from A e Qc that /x x (UjU[Ci n A]) = M2 (U? = i[Ci n A]) 
for all n, hence /xi(A) = /X2(A). 



A.3 Measurable Functions 

There are certain types of functions with which we will be primarily concerned. 



17 This theorem is used in the proofs of Theorems B.32, B.46, B.118, B.131, 
and 1.115, Lemma A.64, and Corollary B.44. 



A. 3. Measurable Functions 583 



Definition A. 27. Suppose that 5 is a set with a a-field A of subsets, and let T 
be another set with a cr-field C of subsets. Suppose that / : S — ► T is a function. 
We say / is measurable if for every B e C, f~ 1 (B) € A. If / is measurable, 
one-to-one, and onto and f~ l is measurable, we say that / is bimeasurable. If 
T = H, the real numbers, and C = B, the Borel <7-field, then if / is measurable, 
we say that / is Borel measurable. 

Proposition A. 28. Suppose that (S,A) and (T,C) are measurable spaces. Sup- 
pose that f : S ->T is a function. 

• If A = 2 s , then f is measurable. 

• IfC = {^,0}, then f is measurable. 

• If A = {5,0}, {y} £ C for every y G T, and f is measurable, then f is 
constant. 

As examples, if S = T = IR, and A = B is the Borel a-field, then all continuous 
functions are measurable. But many discontinuous functions are also measurable. 
For example, step functions are measurable. All monotone functions are measur- 
able. In fact, it is very difficult to describe a nonmeasurable function without 
using some heavy mathematics. 

The following theorems make it easier to show that a function is measurable. 

Theorem A. 29. 18 Let N, S, and T be arbitrary sets. Let {A a : a € H} be a 

collection of subsets of T, and let A be an arbitrary subset of T. Let } : S — ► T 
be a function. Then 

r\A c ) = r\Af. 

PROOF. For the union, if s € f~ 1 (U a eKA a ), then f(s) 6 U a e«^a, hence there 
exists a such that f(s) G i a , so s G f~ l (A a ) and s € U ae ^f~ 1 (A Q ). If s £ 
Uaen/" 1 ^)) then there exists a such that s € f~ 1 (A ot ), hence f(s) e A a , 
hence f(s) € U a eN^a, hence s G /"^UaeH^a). This proves the first equality. 
The second is almost identical in that "there exists a" is merely replaced by "for 
all a" in the above proof. For the complement, if s G f 1 (A c ), then /(s) 6 A c 
and f(s) & A. Hence, s 0 f~ 1 (A) and s <E f'^Af. If s e f'HAf, then 
s <£ f~\A) and f(s) 0 A. So, f(s) e A° and s e r\A c ). □ 

Corollary A.30. 19 // S and T are sets and C is a o -field of subsets of T and 
f : S —>T is a function, then f~ l {C) is a a-jield of subsets of S. In fact, it is the 
smallest a -field of subsets of S such that f is measurable. 



18 This theorem is used in the proof of Theorem A. 34. 

19 This corollary is used in the proof of Theorem A. 42, and it is used to define 



the <7-field generated by a function. 



584 Appendix A. Measure and Integration Theory 



Definition A.31. The cr-field / 1 (C) in Corollary A.30 is called the a -field gen- 
erated, by f. 

A measurable function also generates a cr-field of subsets of its image. 

Proposition A. 32. Let (T,C) be a measurable space. Let U C T be arbitrary 
(possibly not even in C). Define C* = {U fl B : B G C}. Then C* is a a- field of 
subsets of U. 

Definition A.33. The cr-field C* in Proposition A.32 is called the restriction of 
the a-field C to U. If / : S -► T and U = /(S), then C* is called the image a-field 



Theorem A.34. 20 Let (S,A) be a measurable space and let f : S -» T be a 
function. LetC* be a nonempty collection of subsets ofT, and letC be the smallest 
a-field that contains C* . If f~ l {C*) C A, then f~ l (C) C A. 

Proof. Let C 2 be the collection of all subsets B of T such that f~ l {B) G A. By 
assumption, C* C C 2 . We will now prove that C 2 is a cr-field; hence it must contain 
C, which implies the conclusion of the theorem. Clearly, C 2 is nonempty, since C* 
is nonempty. Let A G C 2 . Theorem A.29 implies f~ 1 (A c ) = f~ 1 (A) c € A, since 
^4 is a a-field. This means that A c e C 2 . Let A Xl A 2 , . . • G C 2 . Then Theorem A.29 
implies 



To use this theorem to show that a function / : S -> T is measurable when T 
has a cr-field of subsets C, we can find a smaller collection of subsets C* such that C 
is the smallest cr-field containing C* and prove that C A Theorem A.34 

would then imply f~ l (C) C .4 and / is measurable. As an example, consider the 
next lemma. 

Lemma A.35. 21 Let (S,A) be a measurable space, and let f : S —> IR be a 
function. Then f is measurable if and only if f~ l ((b, 00)) G A for all b G JR. 

PROOF. The "only if" part is trivial. For the "if" part, let C be the collection of 
all subsets of 1R of the form (6, 00). The smallest a-field containing these is the 
Borel cr-field B, so f-\B) CAby Theorem A.34. D 
There are versions of Lemma A.35 that apply to intervals of the form (-00, a] 
and those of the form (a, 6), and so on. Similarly, there is a version for general 
topological spaces. 

Proposition A.36. 22 Let (S,A) be a measurable space, and let (T,C) be a topo- 
logical space with Borel a-field. Then f : S -> JR is measurable if and only if 
f' l (C) G A for all open C (or for all closed C). 



20 This theorem is used in the proofs of Lemma A.35, Proposition A.36, Corol- 
lary A.37, Theorems A.38, B.75, and B.133, and to prove that stochastic processes 
are measurable. 

21 This lemma is used in the proofs of Theorems A.38 and A.74. 
22 This proposition is used in the proof of Theorem A.38. 



off- 




since ,4 is a cr-field. So C 2 is a cr-field. 



□ 



A. 3. Measurable Functions 585 



Another example of the use of Theorem A. 34 is the proof that all continuous 
functions are measurable. The result follows because the Borel a-field is the 
smallest a-field containing open sets. 

Corollary A. 37. Let (5, .4) and (T, B) be topological spaces with their Borel 
cr-fields. If f : S — > T is continuous, then f is measurable. 

Here are some properties of measurable functions that will prove useful. 

Theorem A. 38. Let (S, A) be a measurable space. 

1. Let K be an index set, and let {(T a ,C a )}aeN be a collection of measurable 
spaces. For each a € N, let f a : 5 — ► T a be a function. Define f : S -» 
riaeN Ta by f( s ) = {/a(s)}aGN- Then f is measurable (with respect to the 
product a-field) if and only if each / a is measurable. 

2. If (V,Ci) o.nd (f/,C 2 ) are measurable spaces and f : S — ► V and g : V -» U 
are measurable, then g(f):S-+Uis measurable. 

3. Let f and g be measurable functions from S to IFF, and let a be a constant 
scalar and let b € M n be constant. Then the following functions are also 
measurable: /+ g and a/+ b. Ifn = 1, then f-g and f/g are also measurable, 
where f/g can be set equal to an arbitrary constant when g = 0. 

4- If, for each n, f n is a measurable, extended real-valued function, then 
su Pn U, inf n f n , lim sup n f n , and liminf n f n are all measurable. 

5. Let (T,C) be a metric space with Borel a -field. If f k : S -+ T is a measurable 
function for each k = 1,2,... and lim*-*, f k ( s ) = f(s) for all s, then f is 
measurable. 

6. Let (T,C) be a metric space with Borel o -field, and let /x be a measure on 
(S,A). If f k : S -* T is a measurable function for each k = 1, 2, . . . and 
lim fc _ 00 / fc ( 5 ) exists a.e. [/jl], then there is a measurable f : S —> T such 
that limfc_oo/ fc (s) = f( s ), a.e. [//]. 

PROOF (1) Suppose that / is measurable. To show that f a is measurable, let 

a f a Tt l u B " = T0 f ° r P * Q ' Set C = IKei** w hich is in the 
product a-field because all but finitely many B fi equal the entire space T 0 . Then 
J a {U a ) - / (C). Since / is measurable, /~ '(C) e A. Now, suppose that each 
fa is measurable, and let B = ]J aEH B a> with B a e C a for all a and all but finitely 
many B a (say B ai , . . . , B Qn ) equal to T a . Then /~ 1 (B) = nW"^*- ) 6 A 
Since the sets of the form B generate the product a-field, /" 1 <B) e A for all B 
in the product a-field according to Theorem A.34. 

o( f% L l\ A -U ^ 6 need to P rove that 9(f)- l (A) e A. First, note that 

fW)) *V^^ 9 ' 1{A) e *■ Since f is — abie ' 

(3) The arithmetic parts of the theorem are all similar. They all follow from 
parts 2 and 1. For example, h(x, y) = x + y is a measurable function from Et 2 
to JK, so , ft(/ g) = : / + g i s measurable. For the quotient, a little more care is 
neededLet h(x,y) = x/y when y ^ 0 and let it be an arbitrary constant when 
y - 0. Then/* is measurable since {(x, y):y = 0}is in Z? 2 . It follows that W/, o) 
is measurable. w 3/ 

(4) Let / = sup n / n . Then, for each finite b, {s : f( s ) < fci = n°° , f s • 
/.« < 6} 6 A Also { s : ,(.) = -oo} = n~ : /!(.) = -oo} € ^ and 



586 Appendix A. Measure and Integration Theory 



{s : f(s) = 00} = flgi U^Li {s : f n (s) > i} G A Similar arguments work for 
inf. Since limsup n / n = inf/t sup n > fc / n and liminf n /n = sup fc inf n >fc / n , these 
are also measurable. 

(5) Let d be the metric in T. For each closed set C G C, and each ra, let 
C m = {t : d(t, C) < 1/m}. For each closed C, define 

00 00 00 

A *(C)= fl U f] f:\Cm). (A.39) 

m— 1 n=l k—n 

It is easy to see that A*(C) 6 .A is the set of all s such that lim n -oo fn(s) G 
C. Obviously, f~ l (C) consists of those s such that lim n — 00 fn(s) € C. Hence, 
/ _1 (C) = A*(C) G w4, and Proposition A.36 says that / is measurable. 

(6) Let G = {s : limfc— 00 fk{s) does not exist}, and let G C C with ii{C) = 0. 
Let * G T, and define /(s) = t for s G C and /(s) = lim fc -oo fk{s) for s G C c . 
Apply part 5 to the restrictions of the functions {fk}kLi to C C to conclude that 
/ restricted to C° (call the restriction g) is measurable. If A G C, = 

G .4 if £ £ A and = U C G .A if t G A. So / is measurable. 

□ 

Part 6 is particularly useful in that it allows us to treat the limit of a sequence 
of measurable functions as a measurable function even if the limit only exists 
almost everywhere. This is only useful, however, if we can show that functions 
that are equal almost everywhere have similar properties. 

Many theorems about measurable functions are proven first for a special class of 
measurable functions called simple functions and then extended to all measurable 
functions using some limit theorems. 

Definition A.40. A measurable function / is called simple if it assumes only 
finitely many distinct values. 

A simple function is often expressed in terms of its values. Let / be a simple 
function taking values in IR n for some n. Suppose that {ai, . . . ,a k } are the dis- 
tinct values assumed by /, and let Ai = / -1 ({<**})• Then /(*) = St=i a ^( s )- 
The most fundamental limit theorem is the following. 

Theorem A.41. If f is a nonnegative measurable function, then there exists a 
sequence of simple functions {fi}%Li such that for all s G S, fi(s) T /($)• 

Proof. For k = l,...,i2\ let A kti = {s : (k - l)/2* < f(s) < */2 4 }. Define 
Ao,i = {s : f(s) > i}. Then A 0ti , A M , . . . , A i2 i ti are disjoint and their union is S. 

Define , , % , 

, ^ f if s G A fc ,i for k > 0, 

^ 5 ) = \ i if 

It is clear that /<(«) < /(«) for all i and 5, and each /i is a simple function. Since, 
for k > 0, A M = A 2 fc-i,i+i U A 2k ,i+u and A 0 ,t = U A i2 i+i+ 1>i+ i U • • • U 

A (i+1)2i+1)i+1) it is easy to see that /<(«) < for all i and all 3. It is also 

easy to see that, for each 5, there exists n such that for i > n, |/(s) - fi(s)\ < 2 \ 

Hence fi{s) T /(«). , , f , , . T . 

The following theorem will be very useful throughout the study of statistics. It 
says that one function g is a function of another / if and only if g is measurable 
with respect to the cr-field generated by /. 



A. 4. Integration 587 



Theorem A.42. Let (S\,A\), (5 2 , A2), and (S3, A3) be measurable spaces such 
that A3 contains all singletons. Suppose that f : 5i — ► 52 is measurable. Let A\ / 
be the a-field generated by f. Let T be the image of f and let A* be the image 
a-field of f. Let g : 5i — * 53 be a measurable function. Then g is Aif measurable 
if and only if there is a measurable function h : T — > S3 such that for each s € Si, 
g(s) = h(f(s)). 

PROOF. For the "if" part, assume that there is a measurable h:T S3 such that 
g(s) = h(f(s)) for all s G Si. Let B G A3. We need to show that g~ x {B) G Aif. 
Since h is measurable, h~ l {B) G -4*, so h' l (B) = Tfl A for some A G A2. Since 
f- l {A) = r^THA) and g~ l {B) = f-\h- l {B)), it follows that g~ l (B) = 
f- 1 (A)eA lf . 

For the "only if" part, assume that g is Aif measurable. For each t G 5 3 , let 
Ct = Since # is measurable with respect to Aif, let At € Aif be such 

that C t = /^(At). (Such A t exists because of Corollary A.30.) Define h(s) = t 
for all 5 € At n T. (Note that if ti ^ t 2 , then A tl n A t2 n T = 0, so /i is well 
defined.) To see that g(s) = /*(/(*)), let g(s) = t, so that s G C< = f~\A t ). This 
means that f(s) G A t fl T, which in turn implies &(/(*)) = * = 000- 

To see that h is measurable, let A G A3. We must show that h~ l (A) G A*. 
Since p is Ay measurable, g~ x {A) G Ai/, so there is some £ G A 2 such that 
9 (A) = / 1 (B). We will show that /T 1 (A) = £nT G A* to complete the proof. 
If* G /i (A), then * = h{s) G A and s = /(x) for some x G C t C o'^A) = 
/ (£), so /(x) G B. Hence, s G BnT. This implies that /T^A) Q BnT. Lastly, 
if s G B n T, 5 = f( x ) for some a; G f~ l {B) = tf _1 (A) and *(*) = h(f(x)) = 
o(x) G A. So, /i(s) G A and s G /T^A). This implies B C\T C h~ l (A). □ 

The condition that A3 contain singletons is needed to avoid the situation in 
the following example. 

Example A 43. Let Si = 5 2 = 5 3 = IR and let Ai = A 2 be the Borel a-field, 
while A 3 is the trivial a-field. Then every function g : 5x -> 5 3 is Ai, measurable 
no matter what / 1S , for example, g(s) = *. If /(,) = 8 2 % then q is not a function 



A.4 Integration 

The integral of a function with respect to a measure is a way to generalize the 
notion of weighted average. We define the integral in stages. We start with non- 
negative simple functions. 

Definition A.44. Let / be a nonnegative simple function represented as f(s) = 
2^ i=1 aiJ Ai (s), with the ai distinct and the A* mutually disjoint. Then, the in- 
tegral of f with respect to p is / f(s)d»(s) = ^ ^(A,). If 0 times 00 occurs 
in such a sum, the result is 0 by convention. 

The integral of a nonnegative simple function is allowed to be 00. It turns out 
that the formula for the integral of a nonnegative simple function is more general 
than in Definition A.44. 



588 Appendix A. Measure and Integration Theory 



Proposition A.45. 23 // (5, .4, /i) is a measure space, Ai G A and ai > 0 for 
i = 1, . . . ,n, and f(s) = £" =1 a^Ja), i/ien / f(s)dn(s) = £" =1 ^(-AO- 
Next, we consider general nonnegative measurable functions. If / is a nonnega- 
tive simple function, then for every nonnegative simple function g < /, it follows 
easily from Definition A. 44 that f g(s)dfi(s) < J f(s)dn(s). Hence, the following 
definition contains no contradiction with Definition A.44. 

Definition A. 46. If / is a nonnegative measurable function, then the integral 
of f with respect to \i is / /(s)d/x(s) = sup g < /<g simple / g[s)d\i{s). 

For general functions /, define the positive part as f + (s) = max{/(s),0} and 
define the negative part as f~(s) = -min{/(s),0}. Then f(s) = f + (s)-f(s). If 
/ > 0, then /" = 0 and J f~(s)dfi(s) = 0; hence the following definition contains 
no contradiction with the previous definitions. 

Definition A.47. If / is a measurable function, then the integral off with respect 
to \x is 

Jf(8W(8) = J f + (s)dtl(3)- J f-(s)d»(s), 

if at least one of the two integrals on the right is finite. If both are infinite, the 
integral is undefined. We say that / is integrable if the integral of / is defined 
and is finite. 

The integral is defined above in terms of its values at all points in S. Sometimes 
we wish to consider only a subset of S. 

Definition A.48. If A C S and / is measurable, the integral of f over A with 
respect to /x is 

jj{sW{s) = j ' I A {s)f{s)dn(s). 

Here are a few simple facts about integrals. 

Proposition A.49. Let {S,A,n) be a probability space, and let f,g : S JR be 

measurable. 

1. If f = g a.e. [/x], then J f(s)d^(s) = J g{s)dn(s) if either integral is de- 
fined. 

2. If J f(s)d^(s) is defined and a is a constant, then 

J af(s)dn(s) = a J f(s)dfi(s). 

3. If f and g are integrable with respect to \l, and f <g, a.e. then 

J f(s)dfi(s) < J g(s)dii(s). 

I If f and g are integrable and f A f(s)dfx(s) = f A g(s)dfi(s) for all A € A, 
then f = g, a.e. [/x]. 



23 This proposition is used in the proof of Theorem A.53. 



A. 4. Integration 589 



The proofs of the next few theorems are essentially borrowed from Royden 
(1968). 

Theorem A.50 (Fatou's lemma). 24 Let be a sequence of nonnegative 

measurable functions. Then 

/liminf f($)dfji(s) < liminf / f n (s)d(j,(s). 
n—*oo n—Kx> J 

Proof. Let f(s) = liminf n ^oo/ n (s). Since 

/ f(s)dti(s) = sup / <t>(s)dii(s), 
J simple </></«/ 

we need only prove that, for every simple <p < /, 

J <t>{s)dn{s) <Hminf J f n (s)dfi(s). 

Since this is clearly true if 0(s) = 0, a.s. [/i], we will assume that n(A) > 0, where 
A = {s : <£(s) > 0}. Let <t> < f be simple, let e > 0, and let 6 and M be the 
smallest and largest positive values that <j> assumes. For each n, define 

A n = {s € A : / fc (s) > (1 - c )0(s), for all A: > n}. 

Since (1 - e)<t>{s) < f(s) for all s e A, U~ ^ n = ^ and >i n C 4^ for all n. 
Let £ n = AnA%. 

J fn(s)d»( S ) > / f n ( S )dtl(s) >(l-€) [ <t>{s)d^i{s). (A.51) 

If M(Bn) I = oo for n = n 0 , then = co and / >(s)d/i( 5 ) = oo, since 0 takes 
on only finitely many different values. The rightmost integral in (A.51) is at least 
W An), which goes to oo as n increases, hence lim inf^ / f n (s)d»(s) = oo and 
the result is true. So, assume p{B n ) < oo for all n. Since n~ ,B n = 0, it follows 
from Theorem A.19 that lim^ »{B n ) = 0. So, there exists N such that n> N 
implies jx(£ n ) < e . Since 

J <f>(s)d»(s) = J A MMs) = f^<KsW(8) + f <t>(s)d»(s) 
< / (t>(s)dfi(s) + Me, 
(A.51) implies that, for n > JV, 

fn(s)dfi(s) > (1 - €) y (j>(s)d^(s) - c (l - e )M. 

theorem is used in the P r °o f s of Theorems A.52, A.57, A.60, B.117 
and 7.oU. ' 



590 Appendix A. Measure and Integration Theory 

If / 0(s)d/i(s) = oo, the result is true again. If J <t>(s)dii{s) = K < oo, then for 
every n>N, 

J f n (s)d f x(s) > J <t>(s)dn(s) - c [(l - e)M + K], 



hence 



iimmf J f n (s)dfjL(s) > J <t>(s)d^(s) - c[(l - c)M 4- K]. 
Since this is true for every € > 0, 

liminf J f n (s)d/j,(s) > j (p(s)dn(s). 

□ 

Theorem A.52 (Monotone convergence theorem). Let {f n }n=i be a se- 
quence of measurable nonnegative functions, and let f be a measurable function 
such that f n (x) < f(x) a.e. [//] and f n (x) — f{x) a.e. [n]. Then, 

\hn^ J f n (x)dfj,(x) = J f(x)dfi(x). 

Proof. Since f n < f for all n, / f n (x)d^i{x) < J f{x)dn(x) for all n. Hence 

lhninf j f n (x)dfi(x) < limsup J f n (x)dfi(x) < J f(x)dfj,(x). 

By Fatou's lemma A.50, / f{x)dfi(x) < liminf n _oo / f n (x)dn{x). □ 

Theorem A. 53. // J f(s)dfi(s) and J g(s)dfi(s) are defined and they are not 
both infinite and of opposite signs, then j[f(s) + g(s)]dfj,(s) = f f(s)dfi(s) + 
jg(s)dfx(s). 

PROOF. If f y g > 0, then by Theorem A.41, there exist sequences of nonnega- 
tive simple functions {f n }%Li and {g n }%Li such that } n ] } and g n T 9- Then 
(/n +pn) T (f + g) and /[/„(«) + g n (s)]dfi(s) = / fn(s)dfi(s) + / gn(s)dfj,(s) by 
Proposition A.45. The result now follows from the monotone convergence theo- 
rem A.52. For integrable / and g, note that = (/■+•£) 
What we just proved for nonnegative functions implies that 

J(f + g) + (s)dfi(s) + j r{*W{s) + J 9'{s)d t i{s) 
= J[(f + 9) + (s) + r(s) + g-(s)]d^s) 
= j[(f+9)-(s) + f + (s)+9 + (s)W(s) 

= J(f + (*)*»(«) + J f + (sW(s) + J 9 + (s)d l i(s). 

Rearranging the terms in the first and last expressions gives the desired result. If 
both / and g have infinite integral of the same sign, then it follows easily using 



A. 4. Integration 591 



Proposition A. 49, that f + g has infinite integral of the same sign. Finally, if only 
one of / and g has infinite integral, it also follows easily from Proposition A.49 
that f + g has infinite integral of the same sign. □ 
A nonnegative function can be used to create a new measure. 

Theorem A. 54. Let (5, A, /x) be a measure space, and let f : S — ► JR be non- 
negative and measurable. Then v(A) = J f(s)dfi(s) is a measure on (S,A). 

PROOF. Clearly, v is nonnegative and i/(0) = 0, since f(s)I$(s) = 0, a.e. [/x]. 
Let {j4 n }£Li be disjoint. For each n, define g n {s) = f{s)lA n {s) and f n (s) = 
EHi 0* W- Define A = U5S=iA n . Then 0 < f n < JIa, a.e. [/x) and f n converges 
to /Ia, a.e. [/x]. So, the monotone convergence theorem A. 52 says that 

lim / fn(s)dn(s) = v{A). (A.55) 

n-ooy 

Also, v(Ai) — J gi(s)dii(s), for each i. It follows from Theorem A. 53 that 

V ( U Ai ) = / M s )Ms) = Y, [ 9i(s)dfi(s) = Y,v{Ai). (A.56) 
\i=i / ^ i=i ^ i=i 

Take the limit as n — ► oo of the second and last terms in (A.56) and compare to 
(A.55) to see that v is countably additive. □ 

Theorem A. 57 (Dominated convergence theorem). Let {f n }%Li be a se- 
quence of measurable functions, and let f and g be measurable functions such that 
f n (x) — ► /(x) a.e. [/x], \f n (x)\ < g(x) a.e. [/x], and J g(x)d^(x) < oo. Then, 



lim J f n (x)dn(x) = J /(x)d/x(x). 



Proof. We have — g(x) < f n {x) < g(x) a.e. [/x], hence 

9(x) -r fn(x) > 0, a.e. [/x], 
00*0 ~ /n(x) > 0, a.e. [/x], 
lim [p(x) + / n (x)] = s(x) -f f(x) a.e. [/x], 

n— *oo 

lim \g(x) - f n (x)} = g(x) - f{x) a.e. [/x]. 

n— »oo 

It follows from Fatou's lemma A. 50 and Theorem A. 53 that 

j \g{x) + /(x)]d/x(x) < liminf ^ fo(x) + fn(x)]dn(x) 

= I g(x)dfi(x) + liminf / / n (x)d/x(x), 

J n— oo J 

J /(x)cfyx(x) < liminf J / n (x)d/x(x). 



592 Appendix A. Measure and Integration Theory 



Similarly, it follows that 

j \g(x) - f{x)]d»(x) < lim inf J [g(x) - /„(s)]d/i(*) 

= / g(x)dfi(x) - limsup / / n (x)rf/x(x), 

/ f(x)dfi(x) > limsup / f n {x)d^{x). 

J Tl— ►OO J 

Together, these imply the conclusion of the theorem. □ 
An alternate version of the dominated convergence theorem is the following. 

Proposition A.58. 25 Let {/n}£=i, {0n}£Li be sequences of measurable func- 
tions such that \f n (x)\ < g n (x), a.e. [fj]. Let f and g be measurable functions 
such that lim n — oo/n(x) = f(x) and lim n ->oo gn(x) = g{x), a.e. [/i]. Suppose 
that lim n ^oo J g n {x)d^i{x) = jf g(x)d^(x) < oo. Then, lim„_+oo / f n (x)dn(x) = 
Jf(x)d»(x). 

The proof is the same as the proof of Theorem A.57, except that g n replaces g in 
the first three lines and wherever g appears with f n and a limit is being taken. 

For a-finite measure spaces, the minimal condition that guarantees convergence 
of integrals is uniform integrability. 

Definition A. 59. A sequence of integrable functions {/n}5JLi is uniformly inte- 
grable (with respect to if lim c _>oo sup n ( x )\ >c } \ f^{ x )\d^i{x) — 0. 

Theorem A. 60. 26 Let n be a finite measure. Let {fn}^L\ be a sequence of inte- 
grable functions such that lim n — oo fn = f a.e. [/xj. Then lim„— <» J f n (x)dn(x) = 
f f(x)dfj,(x) if {/n}5?Li is uniformly integrable. 27 

PROOF. Let /,+*, , / + , and /" be the positive and negative parts of f n and 
/. We will prove that the result holds for nonnegative functions and take the 
difference to get the general result. Let e > 0 and let c be large enough so that 
sup n / {x:/n(x)>c} fn{x)dfi(x) < t. The functions 

9n[X)-^ c \ff n (x)>C 

converge a.e. [/i] to 



We now have 



Q(x ,f m if/(x)<c, 

9[X) ~ \ c if /(*) > c. 

J f(x)dfjL{x) > J g(x)dfjL(x) = Jiirn J g n (x)dfi(x) 
> limsup / / n (x)d/x(x) - e, 

n— ♦oo J 



25 This proposition is used in the proof of Scheffe's theorem B.79. 
26 This theorem is used in the proofs of Theorems 1.121 and B.118. 
27 0ne could replace "if" by "if and only if," but we will never need the "only 
if" part of the theorem in this book. 



A. 5. Product Spaces 593 



where the second line follows from the dominated convergence theorem A. 57 and 
the third from our choice of c. Since this is true for every e, we have f f(x)dji{x) > 
limsup f f n (x)dfx(x) . Combining this with Fatou's lemma A. 50 gives 



J f{x)dn(x) = lim j f n {x)dfi(x). 



□ 



A. 5 Product Spaces 

In Definition A. 12, we introduced product spaces and product tr-fields. We would 
like to be able to define measures on (Si x 52, Ai <8> A2) in terms of measures on 
(Si,Ai) and (52, The derivation of product measure given here resembles 
the derivation in Billingsley (1986, Section 18). 

Lemma A. 61. 28 Let (5i,.4i,/ii) and (S2, .42,^2) be a-finite measure spaces, 
and let A\ ® Ai be the product a -field. 

• For every B G Ai ® Ai and every x G Si, B x — {y : (x, y) G B} G A2 and 
H>2(B X ) is a measurable function from (5i,.4i) to 2RU {00}. 

• For every B G Ai ® A2 and every y G 52, B y = {x : (x, y) G B} G Ai and 
/zi(B y ) is a measurable function from (52, .A2) to IRU {00}. 

Proof. Clearly, we need only prove one of the two sets of assertions. First, let 
B = Ai x A 2 with Ai G Ai for i = 1, 2 and x G Si. Then 

B = / A<1 if X 6 Al1 
x 1 0 otherwise. 

So, B x G A2- Let C be the collection of all sets B C Si x 52 such that B x G A2. 
If B G C, then (B c ) x = {y : (x,y) 0 B} = (B x ) c , so B c G C. Let {£ n }n=i G C. 
Then it is easy to see that 

(00 \ ( 00 ^ 00 00 

|J B n ) = I y : (x,y) € |J B n \ = [j {y : (x,y) € S„} = (J( B »)* G C - 
n=l / x V n=l J n=l n=l 

(A.62) 

Clearly, 5i x 52 G C, so C is a a-field containing all product sets; hence it contains 
Ai®A2. Next, let f B (x) = /x 2 (B x ) for B G .Ai®^. Write 5i x5 2 = U~ =1 JE? n with 
E n = Ai n x >^2n and Hi(Ain) < 00 for all n and i = 1, 2 and with the E n disjoint. 
Then let /b.u = M2((B fl E n ) x ). It follows that /b = ]C^=i / B > n ' ^ we can snow 
that /b.ti is measurable for each n, then so is /b, since they are nonnegative, and 
the sum is well defined. If B = Bi x B 2 , then /B,n(x) = /A ln nBi(x)/i2(^2nnB 2 ), 
which is a measurable function. Let V be the collection of all sets D C 5i x 52 



This lemma is used in the proofs of Lemmas A.64 and A. 67 and Theo- 
rems A.69 and B.46. 



594 Appendix A. Measure and Integration Theory 



such that /D,n is measurable. If D € P, then / D c n = /i2(^2n) - /d,u, which is 
measurable, so D c 6 P. If {D m }~ =1 G P with the D m disjoint, then 

(oo \ oo 

|J (D m n J = ^ M2 (D m n E n ) x 
m=l / m=l 

oo 
m=l 

which is a measurable function, so U^ =1 D m 6 P. Clearly, Si x 5 2 E P, so P is 
a A-system (see Definition A. 14) that contains the 7r-system of product sets. By 
the 7T-A theorem A. 17, P contains Ai <g) A*. □ 
The following corollary to Lemma A.64 is a sort of dual to part 1 of Theo- 
rem A.38. 

Corollary A.63. Let (Si,Ai), (S2M2), and (X,B) be measurable spaces. If f : 
S\ x S2 — ► X is measurable, then for every si £ Si, / S1 (s2) = /(si,S2) is a 
measurable function from S2 to X. 

Lemma A.64. 29 Suppose that and (S2,A2, ^2) are a-finite measure 

spaces. For each x € Si, y £ S2, and B € Ai <8> A2, define B x and B y as in 
Lemma A. 61. Then v x (B) = J s ^ H2(B x )dfii(x) and V2(B) = J s ^ fn(B v )dfj,2(y) 
both define the same measure on (Si x S2,Ai <8>^2). If Ai € Ai for i = 1, 2, then 
vi(Ai x A 2 ) = /ii (Ai)fi 2 (A 2 ). 

Proof. First, prove that ui is a measure. The proof that 1/2 is a measure is 
identical. Clearly, i/i(B) > 0 for all B and i/i(0) = 0. If {B n }n=i are disjoint, 
then 

(00 \ p 00 00 p 

U S "1 = / Y^^ Bn ^ d ^ x) = Y^ I M2((B„)x)d/Ji(x) 

00 

= 

n=l 

where the first equality follows from the definition of 1/1, the fact that /12 is 
countably additive, and (A.62); the second equality follows from the monotone 
convergence theorem A. 52 and the fact that Yln=i ^((B^x) < Y^^Li ^2((B n )x) 
for all m; and last equality follows from the definition of vi. This proves that vi 
(and so too 1/2) is a measure. Note that if B — Ai x A2, then 

vi(B) = / lA 1 (x)n2(A 2 )d^Li(x) = fii(Ai)fi2{A 2 ) 

JSi 

= I lA 2 {y)in(Ai)dii2{y) = v2{B). 
Js 2 

So, 1/1 = 1/2 on the 7r-system consisting of product sets. Since each of ^1 and /12 is 
(7-finite, there exists a countable collection of product sets whose union is Si x S 2 



29 This lemma is used in the proof of Lemma A.67. 



A. 5. Product Spaces 595 



and such that each one has finite v\ = v 2 measure. By Theorem A. 26, v\ agrees 
with U2 on all of Ai ® A 2 - □ 

Definition A.65. Let (5j, Ai^i) for i = 1,2 be cr-finite measure spaces. Define 
the product measure fii x // 2 on (Si x S2,.4i ® ^2) as the common value of the 
two measures vi and v 2 in Lemma A.64. 

Lebesgue measure on JR. 2 , denoted dxdy, is a product measure. Not every 
measure on a product space is a product measure. Product probability measures 
will correspond to independent random variables. (See Theorem B.66.) 

Proposition A.66. Let fi be a measure on a product space (Si x S 2l Ai <8> A 2 ). 
Then fx is a product measure if and only if there exist set functions fn : Ai — > 
2R for i = 1,2 $uc/i tfiatf, /or even/ 4i € .Ai and A 2 € *4 2 , /i(^i x A 2 ) = 
Mi (^1)^2(^2). 

Lemma A.67. 30 Let f be a measurable function from Si x S 2 to 1R such that 
either {x £ Si : f \f(x,y)\dfi 2 (y) = 00} C A e A u where fn(A) = 0, or f > 0. 
Then, there is a measurable (possibly extended real-valued) function g : Si -> 
lRU{±oo} such thatg(x) = / f(x,y)d/j, 2 (y), a.e. [^]. /// u tfie indicator of a 
measurable set B, then 



j g(x)dm(x) = fnx pL 2 (B). (A.68) 

Proof For each B e Ai ® A 2) note that / / B (x, yj^y) = M2 (B X ), where 
is defined in Lemma A.61. It was shown there that n 2 (B x ) is a measur- 
able function of x. It follows from Lemma A.64 that (A.68) holds. It now fol- 
lows from the linearity of integrals that if / is a nonnegative simple function, 
then g(x) = J f(x,y)dfi 2 (y) is a measurable function of x. If / is a nonnegative 
measurable function, let {/ n }~ x be a sequence of nonnegative simple functions 
such that f n <f for all n and lim^ f n (x,y) = f( Xy y) for all (x,y). Then, 
the monotone convergence theorem A.52 says that lim„^oo //„(*, y)d^(y) = 
J mvW 2 (y) = for all x. By part 5 of Theorem A.38, g is measurable. If 
6 Si : J \f{x, y)\dfi 2 (y) = 00} = 0, then the argument just given applies to 
both / and f' and the difference / /+ (*, y)o> 2 (y) - f f (*, y)d^(y) is defined 
a.e. M and equals / /(*, y) dMa(y ), a . e . If we let g[x) = J /+(a?f y) ^ (y) _ 
J / (x,y)dn 2 (y) for all * 0 A, and let <?(*) be constant on A, then = 
J f( x iy) d V2(x), a.e. [/xi], and g is measurable. D 
The following two theorems will be used many times in the study of product 
spaces. 

Theorem A.69 (Tonelli's theorem). Let (S U A U ^) and be a- 

Then ^ Letf:S > xS ^ mb ea nonnegative measurable function. 



j f(x,y)dm x ti2(x,y) = J^J f(x,y)dn l {x) 



dfj.2(y) 



30 



This lemma is used in the proofs of Theorem A.70 and of Lemmas 6.48 



and B.46 



596 Appendix A. Measure and Integration Theory 



-JU 



f(x,y)dii 2 (y) 



dfii(x). 



Proof. As in the proof of Lemma A.67, let {f n }%Li be a sequence of non- 
negative simple functions such that f n < f for all n and lim n — oo fn(x,y) = 
/(x,y) for all (x,y). If / n (x,y) = a i>n /B iin (x,y), then / f n (x J y)d^ 2 {y) = 
Ylili a i,nM2(Bi )n)X ) by Lemma A. 61 and 



/[//„(-, 



yW2(y) 



dfii{x) 



by (A.68). Since 0 < / /n(x, y)c^ 2 (y) < f /(x, y)d/i 2 (y) for all x and n, and 
lim n --oo / fn{x J y)dfi2{y) = J /(x,y)d/x 2 (y) as in the proof of Lemma A.67, it 
follows from the monotone convergence theorem A. 52 that 



y)dii\ x /x 2 (x,y) 



y)d/zi x fi 2 (x,y) 



— lim / /„(x, 
= y ^ f^{x 1 y)dn 2 {y) 

= y[ lim y ' M^y)dMy) 

- /[/" 



/(x,y)d/z 2 (x,y) 



d/xi(x) 
rf/ii(x). 



The proof that the iterated integrals can be calculated in the other order is 
similar. □ 

Theorem A. 70 (Fubini's theorem). Let (Si,Ai,in) and (S 2i A 2l fJL 2 ) be a- 
finite measure spaces. If / : Si x S 2 — ► 2R is integrable with respect to fii x fj, 2 , 
then 



Jf(x,y)dfjLiXfi2(x 9 y) = J^Jf(x 9 y)diMi(x) d^ 2 {y) = J^Jf(x J y)d^i 2 (y) 
Proof. Let g(x) = / \f{x,y)dii 2 {y), a.e. be measurable. Then 



d/xi(x). 



d/ii(x) = y \f(x,y)\dm x /i 2 (x,y) < oo 



y ^x)d/x!(x)= y |y i/or^i^y) 

follows from Tonelli's theorem A.69 applied to |/|. It follows that 

j* : y \f(x,y)\dii 2 (y) = oo J C A € A 

implies /xi(A) = 0. Apply Tonelli's theorem A.69 to / + and /" and note that 
the set of all x such that / f+{x,y)dii 2 {y) - / /""*(*, 2/)cMy) is undefined is a 
subset of {x : J \f(x,y)\dfi2(y) = oo}. It follows that this difference of integrals 
is defined a.e. [/nj and the integral (with respect to m) of the difference (which 
equals ][^ !{x,y)d^{y)Wi{x)) is the difference of the integrals (which equals 

f f(x,y)dm x /x 2 (x,y)). „ . . D 

All of the results of this section can be extended to finite product spaces 
Si x • • • x S n by simple inductive arguments. 



A. 6. Absolute Continuity 597 



A. 6 Absolute Continuity 

It is also common to consider two different measures on the same space. 

Definition A. 71. Let /ii and \i2 be two measures on the same space (5,^4). 
Suppose that, for ail A € A, in(A) = 0 implies 112(A) = 0. Then, we say that 112 
is absolutely continuous with respect to fix, denoted // 2 <^ fii. When /i2 <C /xi, we 
say that fii is a dominating measure for /i 2 . 

Consider next a function / and a measure fi such that f f(x)dfi(x) is defined. 
Then v(A) = J f(x)dfi(x) is defined for all measurable A. If / takes on negative 
values with positive measure, then v is not a measure because it assigns negative 
values to some sets, such as A = {x : f(x) < 0}. However, v is still a signed 
measure. 

If one of a pair of two measures is finite, there is a necessary and sufficient 
condition for absolute continuity which resembles the definition of continuity of 
functions. 

Lemma A.72. 31 Let in and fi 2 be measures on a space (S,A). Consider the 
following condition: 

For every e > 0, there is S e such that m(A) < <5 e implies /x 2 (A) < e. (A.73) 

• If condition (A.73) holds, then /i 2 < 

• If ^2 < in and \i2 is finite, then condition (A.73) holds. 

Proof. For the first part, let e > 0 and suppose that m(A) = 0. Then in(A) < 6 e 
and n 2 (A) < e. Since this is true for all e > 0, // 2 (A) = 0. For the second 
part, suppose that fi 2 < Ml , that fi 2 is finite, and that (A.73) fails. Then there 
exists e > 0 such that, for every integer n, there is A n with fJn(A n ) < 1/n 2 but 
MAn) > e. Let A = ngij USt fc A n . By the first Borel-Cantelli lemma A.20, 
ln(A) = 0 so> 2 (A) = 0. Since /i 2 is finite, Theorem A.19 implies that 



□ 



MA) = ^(J A n J > C. 

This is a contradiction. 

The following theorem says that the first part of Example A.8 on page 574 b 
the most general form of absolute continuity with respect to <r-finite measures, 
ine proof is mostly borrowed from Royden (1968). 

Theorem A 74 (Radon-Nikodym theorem). Let ^ and & be measures on 
\ \ SUCk that M2 » l and & is ^finite. Then there exists an extended real- 
valued measurable function f : S - [0,oo] such that for every A e A, 

M^)= I f(x)diMi(x). (A.75) 



31 This lemma is used in the proof of Lemma B.119. 



598 Appendix A. Measure and Integration Theory 



Also, if g : S IR is fi2 integrable, then 

j g(x)dfi 2 (x) = J g(x)f(x)din(x). (A.76) 

The function f is called the Radon-Nikodym derivative of \i 2 with respect to xxi 
and it is unique a.e. [/xi]. The Radon-Nikodym derivative is sometimes denoted 
(dfi2/dfii)(s). If fi2 is a -finite, then f is finite a.e. [/xi]. 

Proof. First, we prove uniqueness a.e. [/xi]. Suppose that such an / exists. 
Let g be another function such that / and g are not a.e. equal. Let A n = 
{x : f{x) > g(x) + 1/n} and B n = {x : f(x) < g(x) — l/n}. Since / and 
g are not equal a.e. [/xi], then there exists n such that either fii(A n ) > 0 or 
/xi (B n ) > 0. Let A be a subset of either A n or B n with finite positive measure. 
Then J A /(x)dxxi(x) ^ f A g(x)dfii(x). Hence g / d^/dfii. 

The proof of existence proceeds as follows. First, we show that we can reduce 
to the case in which /xi is finite. Then, we create a collection of signed measures 
Vol indexed by a real number a. For each a we find a set A a such that every 
subset of A a has positive v a measure and every subset of the complement B a 
has negative v a measure. We then show that B 13 C B a for (3 > a, which allows 
us to define f(x) = sup{a : x £ B a }. Finally, we show that / satisfies (A. 75) and 

( A - 76) - 

Now, we prove that we need only consider finite /xi. Since /xi is cr-finite, let 
{A n }%Li be disjoint elements of A such that ^i(Ai) < oo and S = Ug^. Let 
/Xj,i be /Xj restricted to Ai for j = 1, 2 and each i. Then /x 2 ,i < /xi.i for each i and 
each xxi,i is finite. Suppose that for each i we can find /» as in the theorem with 
Hj replaced by ^ for j = 1,2. Then f(x) = ^(a;)/i(x) is the function 
required by the theorem as stated. Hence, we prove the theorem only for the case 
in which /xi is finite. 

Suppose that xxi is finite, and define the signed measure v a = a/xi - /X2 for 
each nonnegative rational number a. (Note that u a (A) never equals oo, although 
it may equal — oo.) For each a, define 

P a = {Ae A: v a (B) > 0, for every B C A}, 

A« = sup Vc*(A). 
AeP a 

That is, A a is the supremum of the signed measures of sets all of whose subsets 
have nonnegative signed measure. 32 Since 0 6 F Q , A Q > 0. Let {A n }n=i be 
such that Aa = linii^oc ^a(Ai), and let A" = UfLiU Since every subset of A° 
can be written as a union of subsets of the A u it follows that A a £ Fa, hence 
Aa > fa(A a ). Since A a \ Ai C A a , it follows that u a (A a \ Ai) > 0 for all i and 
!/ a (A a ) = MA 01 \ Ai) 4- Va(Ai) > v a {Ai) for all i. It follows that A Q < MA"). 
Hence A Q = M^ a ) < oo. Define B a = (A«f • 3 

Next, we prove that every subset of B a has nonpositive measure. If not, let 
BQB a such that u a (B) > 0. If B has no subsets with negative signed measure, 



32 The sets in P a are often called the positive sets relative to the signed measure 



Vex- 



33 Such sets are called negative sets relative to the signed measure u a . 



A. 6. Absolute Continuity 599 



then BuA a G P a and u a (A a UB) > A a , a contradiction. So, let n\ be the smallest 
positive integer such that there is a subset B\ C B with v a (B\) < — 1/m. For 
each k > 1, let nk be the smallest positive integer such that there exists a subset 
B k C B\ U^Bi with v a (B k ) < -l/n k . Now, let C = £ \ Uj^B*. Clearly 
fa(C') > 0. If we prove that C has no subsets with negative signed measure, 
then C G P a and we have another contradiction. So, suppose that D C C has 
v a (D) = -e < 0. Since !/«(£) > 0, it must be that J27=i Ua ( Bk ) > 
Hence limfc_+oo7ifc = oo. So, there is k such that l/(uk+\ — 1) < e. Notice that 
D C C C B\Ui =1 Bfc. Since i/ a (D) < — l/(njfe+ 1 -1), this contradicts the definition 

Offifc+i. 

If 0 > a, we have 

i/«(A Q fl B 0 ) > 0, ^(A a fl B 0 ) < 0. 

Subtract the first inequality from the second to get (0 — a)ni(A a fl B 0 ) < 0, 
from which it follows that ^Li(A a D B 0 ) = 0. Since vp(A) > v a (A) for 0 > a, we 
can assume that A a C A 0 \i 0 > a. It follows that B 0 C J5 a for /3 > a, and we 
can define f(x) = sup{a : x G B a }. Since B° = 5, /(x) > 0 for all x. It is easy 
to see that /(x) > a if x e B a and /(x) < q if x G A a . It is also easy to see that 
{x : /(x) > 6} = U a >{,B Q . Since this is a countable union of measurable sets, it 
is measurable. By Lemma A. 35, / is measurable. 

Next, we prove that (A. 75) holds for every A G A. Let A £ A be arbitrary 
and let c > 0 be given. Let N > yL\(A)/e be a positive integer. Define Ek = 
A n B k/N H A {k W and £oo = A \ \Jf =l A k,N . Then A = U^Ek U £<x> and 
the Ej are all disjoint. So fi2(A) = ^2(^00) + JZfcLo A*2(^fc). By construction 
f(x) G [k/N,(k + 1)/AT] for all x € E k and /(x) = 00 for all x G E<x>> Since 
fk/N(Ek) < 0 and i/(/t+i)/jv(-^fc) > 0, we have, for finite k, 



fJ>2(E k ) - / f(x)dfii(x) 

JE k 



< jjtii(Ek). (A.77) 



If tii(Eoo) > 0, then /X2(#oo) = 00 since v Q (Eoo) < 0 for all a. If /xi(Eoo) = 0, 
then miEoo) = 0 by absolute continuity. Either way, ^2(^00) = f E /(x)d/xi(x). 
Adding this into the sum of (A.77) over all finite k gives 



J E 



112(E)- / /(*)<*/!! (*) 



<^/X 1 (E)<6. 



Since this is true for every c > 0, (A. 75) is established. 

To prove (A. 76), we note that it is true if g is an indicator function, hence it 
is true for all simple functions. By the monotone convergence theorem A.52, it is 
true for all nonnegative functions and by subtraction it is true for all integrable 
functions. 

Finally, if /(x) = 00 for all x G A with /xi(i4) > 0, then 112(B) = 00 f° r every 
B C A such that /ii (B) > 0. It is now impossible for \i2 to be cr-finite. □ 

In statistical applications, we will often have a class of measures, each of which 
is absolutely continuous with respect to a single cr-finite measure. It would be 
nice if the single dominating measure were in the original class or could be con- 
structed from the class. The following theorem addresses this problem. The proof 
is borrowed from Lehmann (1986). 



600 Appendix A. Measure and Integration Theory 



Theorem A.78. 34 Let \i be a a-finite measure on (S,A). Suppose that N is a 
collection of measures on (S, A) such that for every v G N, v \i. Then there 
exists a sequence of nonnegative numbers {ci}^ and a sequence of elements of 
N, {i/iji^i such that YliLi Ci = * an d v ^ Si^i CiVi f or ever V v € 

Proof. If N is a countable collection, the result is trivially true. If /x is finite, 
let A = \i. If /i is not finite, then there exists a countable partition of S into 
such that 0 < /i(Si) = di < oo. For each B G A, let X(B) = £~ x n(B n 
Si)/(2 l di). In either case A is finite and v < A for every i/ G N. Define Q to be 
the collection of all measures of the form Y^Li aiVi wnere Y^iLi a * = * anc * eac ^ 
i/i G N. Clearly ft G Q implies ^«A. 

Next, let £> be the collection of sets C in A such that there exists Q G 2 
satisfying A({x € C : dQ/d\(x) = 0}) = 0 and Q(C) > 0. To see that V is 
nonempty, let v be a measure in N that is not identically 0 and let C = {x : 
dv/d\{x) > 0}. Then with Q = i/, we have {x G C : dQ/d\(x) = 0} = 0 and 
Q(C) = i^(C) = t/(S) > 0, so C G V. Since A is finite, sup CeI > A(C) = c < oo, so 
there exist {C n }£Li such that lim„— oo A(C n ) = c and C n G 2? for all n. Let Co = 
U£° = i C n and let Q n G Q be such that Q n (C n ) > 0 and A({x G C n : dQ n /d\(x) = 
0}) = 0. Let Q 0 = E^i 2 ~ n Qn € Q, so that dQo/dA = £~ =1 2~ n dQn/d\ and 

{ l£ Co:> = 0}c(j{ ie C„ : ^)=o}, 

n=l 

which implies that Co 6 2? and A(Co) = c. 

Since Qo 6 Q, we now need only prove that v < Qo for all v G N to finish 
the proof. Suppose that Qo(A) = 0 and v G N. We must prove u(A) = 0. Since 
Qo(A D Co) = 0 and dQ Q /d\(x) > 0 for all x G C 0 , it follows that X(A D Co) = 0 
and hence v(Af\Co) = 0. Let C = {x : dv/d\{x) > 0}. Then, i/(>inC£nC c ) = 0 
since dv/d\(x) = 0 for x G C c . Let D = ^nCo" DC, which is disjoint from Co. If 
A(D) > 0, then A(C 0 UD)> A(C 0 ) and DeV.lt follows easily that C 0 U D G £> 
and A(C 0 UD)> A(C 0 ) contradicts A(C 0 ) = c. Hence A(D) = 0 and v{D) = 0, 
which implies i/(i4) = fl Co) + i/(i4 n C 0 C n C c ) + K#) = 0. □ 

There is a chain rule for Radon-Nikodym derivatives. 

Theorem A.79 (Chain rule). 35 Let v and n be a-finite measures and suppose 
that n < v < rj. Then 

± {s) = f {s) ^ {sh a.e.W. (A.80) 
drj dv dn 

Proof. It is easy to see that /i <C n so that dfi/drj exists. For every set A, it 
follows from (A.76) that 



34 This theorem is used in the proofs of Lemmas 2.15 and 2.24. It appears 
Theorem 2 in Appendix 3 of Lehmann (1986) and is attributed to Halmos a 
Savage (1949). 

35 This theorem is used in the proof of Lemma 2.15. 



A. 6. Absolute Continuity 601 



By the uniqueness of Radon-Nikodym derivatives, (A.80) holds. □ 
The Radon-Nikodym theorem A. 74 relates integrals with respect to two dif- 
ferent measures on the same space. There are also theorems that relate integrals 
with respect to two different measures on two different spaces. 

Theorem A. 81. A measurable function f from one measure space (Si,Ai,fii) 
to a measurable space (S2, A2), f : Si — > S2, induces a measure on the range S2. 
For each A € A2, define [12(A) = /x x (A)). Integrals with respect to [12 can 
be written as integrals with respect to in in the following way: If g : S 2 IR is 
integrable, then 

J g(y)dn 2 (y) = J g(f{x)Wi{x). (A.82) 

Proof. What needs to be proven is that fj, 2 is indeed a measure and that (A.82) 
holds^To see that fi 2 is a measure, note that if A, B € A 2 are disjoint, then so too 
are / -1 (A) and / _1 (£). The fact that /i 2 is nonnegative and countably additive 
now follows directly from the same fact about /xi. 

If 9 : S 2 -* IR is the indicator function of a set A t then 

J 9(y)dii 2 (y) = M2(^4) = Mi(/" 1 (^)) 

= / tf-HA^Wiix) = J g{f{x))d/JLi{x). 

That (A.82) is true for all nonnegative simple functions follows by adding the far 
ends of this equation (multiplied by positive constants). The monotone conver- 
gence theorem A.52 allows us to extend the equality to all nonnegative integrable 
functions. By subtraction, we can extend to all integrable functions. □ 

Definition A.83. The measure fi 2 in Theorem A.81 is called the measure induced 
on (62, A 2 ) by f from in . 

If the measure ^ in Theorem A.81 is not finite, and the function / is not 
one-to-one, the measure ^2 may not be very interesting. 

^T^\Ytu e l Sl = <S' ? 2 = 1R ' ^ equal Lebes & ue ******* ^ IR 2 , and 
on/? 1 d two^-fiekls be Borel a-fields. The measure M 2 that / induces 

T&tJ^ato?" f0ll0Wi ^\ IM € * and the Lebesgue meas/re of A is 

with l^\n?'J?^™' MA) > = °°" Alth ° Ugh *» is absol ^ely continuous 
with respect to Lebesgue measure, it is not <r-finite. The only functions a that 
are integrable with respect to M 2 are those that are almost eve^ryXe 0 

« Jf^JH ^ u nUe ' thGre *f a Way t0 aV ° id the problem in Exam P le A.84 by making 
use of the following result. 8 

Theorem A.85.* 6 A measure M on a space (5, ^ is a-/imte and only if there 
exists an integrable function f : S -+ m such that f > 0, a.e. [/x]. 

36 This theorem is used in the proof of Theorem B.46. 



602 Appendix A. Measure and Integration Theory 



Proof. For the "if" part, let / be as in the statement of the theorem. Let 0 < 
/ f(s)dn(s) = c < oo. Let A n = {s : l/n < f(s) < l/(n - 1)}, for n = 1, 2, . . .. 
We see that A\ = {s : f(s) > 1} and S = U%LiA n . We can write 

n=l^ A n n=r^n n=l 

It follows that /x(^n) < nc for all n. Hence /x is <r-finite. 

For the "only if" part, assume that /x is cr-finite, and let {-An}^ be mutually 
disjoint sets such that S = UjJLxAn and /x(A n ) < 00 for all n. Define /(s) to 
equal 2~ n /fi(A n ) for all s £ A n and for all n such that fi(A n ) > 0. For n such 
that m(Ai) = 0, set f(s) = 0 if s e A n . Then 

j f(s)dn{s) = X J^^ An) ^ L ° 

* n= 1 



Example A. 86 (Continuation of Example A. 84; see page 601). Let h(x,y) = 
exp(— [x 2 -f 2/ 2 ]/2). It is known that h is integrable with respect to /xi and /i 
is everywhere strictly positive. Let /x'i(C) = f c h(x i y)d/j,i(x,y). Then /xi < /xi 
and iii /ii. The measure /x 2 induced on (S 2 ,.A2) from /xi by f(x,y) = x 
is /x 2 (#) = V^r J B exp(-x 2 /2)dx. A function # : £2 -> 1R is integrable with 
respect to /x 2 if and only if exp(~x 2 /2)^(x) is integrable with respect to Lebesgue 
measure. 

As a sort of reverse version of Theorem A. 81, functions from a measurable 
space to a measure space induce measures on the domain space. 

Proposition A.87. Let f be a measurable function from a measurable space 
(#i,-Ai) to a measure space (S 2 ,-42,/X2), / : Si S 2 . Let Aif C .Ai be the 
a -field generated by f, and let T be the image off. Suppose that T E Ai> Then f 
induces a measure /xi on (Si,-Ai/) defined by fii(A) = /x 2 (Tn£) if A = /" (B). 
Furthermore, if g : (Si,.4i/) 2R is integrable with respect to /xi, then 

j g(x)d»i(x) = J h(y)dti 2 (y), (A.88) 

where h satisfying h(f(x)) = g(x) is guaranteed to exist by Theorem A.42. 



A. 7 Problems 

Section A. 2: 

1. Let 5 be a set and let A be the collection of all subsets of 5 that either 
are countable or have countable complement. Prove that A is a er-field. 

2. Prove Proposition A. 10 on page 575. 



A.7. Problems 603 



3. Prove Proposition A. 13 on page 576. (Hint: First, show that every open 
ball in IR* is the union of countably many open rectangles. Then prove 
that the smallest <7-field containing open balls must be the same as the 
smallest cr-field containing open rectangles.) 

4. Prove that defined on page 571 is a <r-field of subsets of the extended 
real numbers. 

5. Prove Proposition A. 15 on page 576. 

6. Prove Proposition A. 16 on page 576. 

7. *Let F : IR — > fft be a nondecreasing function that is continuous from the 

right. For each interval (a, 6], define v>((a,b]) = F(b) - F(a). 

(a) Suppose that {(a n ,b n ]}%Li is a sequence of disjoint intervals such 
that U~ 1 (a n ,6 n ] C (a, 6]. Prove that £~ t /x((a n ,6 n ]) < /x((a,6]). 
(Hint: Prove it for finite collections and take a limit.) 

(b) Suppose that {(an.bn]}^ is a sequence of disjoint intervals such 
that (a, b] C USafanA]. Prove that £~ x fi((a ni b n ]) > n((a,b]). 
(Hint: First, prove it for finite collections by induction. For infinite 
collections, let /i((a,6]) > e > 0. Cover a compact interval [a + 6,b] 
with finitely many open intervals (a n ,b n 4- 6 n ) such that |/i((a,61) - 

^ a i M])l < »S«lM((an,M)-E^lM((an,6h+«n])| < 

e/2. This can be done by using continuity from the right.) 

(c) Prove that » is countably additive on the smallest field containing 
intervals of the form (a, b}. (Hint: Deal separately with finite and 
semi-infinite intervals) 

8. A measure space (5, A, /x) is complete if A C B 6 A and fi(B) = 0 implies 
A e A. Let (5,C,/x) be a measure space, and let V = {D • 3A C e 
C with DAA C C and M (C) = 0}. For each DeP, define p*(£>) = ' M M) 

7c ' e ^ S ° and ^ = °' Show tha * A** is well defined and that 
(5, P, ix ) is a complete measure space. 

Section A. 3: 

9. Prove Proposition A.28 on page 583. 

10. Prove Proposition A.32 on page 584. 

11. Prove Proposition A.36 on page 584. 

12 ' ^ if f ' M l be a , measure s P ace > and let {/n}S.i be a sequence of mea- 
surable functions from S to IR. Suppose that for every e > 0, T°° u({s • 

fiiV rn ? 0 " ^ Um — = °> «■ M . (HintrC Ihe 

first Borel-Cantelli lemma A.20.) 

13. Let (5,Mj) for j = 0,1,2,3 be measurable spaces. Let /, : 5 0 -» be 
measurable and onto for j = 1,2,3. Let A, 3 be the cr-field generated by 
•/ j J ' 2 ' Prove that h is measurable with respect toA,nA 2 
if and only if there exist measurable 9j : S, - 5 3 for j = 1,2 such that 
J3 = 9i(fi) = g 2 {f 2 ). 



604 Appendix A. Measure and Integration Theory 



Section A. 4: 



14. If / > 0 is measurable and J f(s)dfi(s) = 0, then show that f(s) — 0, a.e. 

H 

15. If f(s) > 0 for all s G A and ii(A) > 0, prove that f A f(s)dn(s) > 0. 

16. Prove Proposition A.45 on page 588. (Hint: Use induction on n.) 

17. Prove Proposition A.49 on page 588. (Hint: For part 4, use Problem 14 on 
page 604.) 

18. Let S = IR, and let A be the a- field of sets that are either countable or have 
countable complement. (See Problem 1 on page 602.) Let /x be Lebesgue 
measure. Suppose that / : S -» IR is integrable. Prove that / = 0, a.e. [fj]. 

19. Let (S,A) be a measurable space, and let / be a bounded measurable 
function. (That is, there exist a and b such that a < f(x) < b for all 

xes.) 

(a) Let /i be a measure on (S,A) such that fi(S) = 1. Prove that 



f(x)dfi(x) < b. 



(b) Let c > 0. Prove that there exists a simple function g such that for 
all measures /x satisfying fi(S) = 1, | J f(x)dfi(x) — j g(x)dfi(x)\ < e. 

20. Prove the following alternative type of monotone convergence theorem: 
Let {/n}^! be a sequence of integrable functions such that f n (x) con- 
verges monotonically to f(x) a.e. [jz]. Then J f(x)dfi(x) is defined and 
J* f(x)dfi(x) = lim n -ooJ fn(x)dfi(x). (Hint: Use the dominated conver- 
gence theorem A. 57 on the positive parts of f n and the monotone con- 
vergence theorem A.52 on the negative parts, or vice versa, depending on 
whether the convergence is from above or below.) 

21. Let (5,^,/i) be a measure space, let {y n }^Li be a sequence of integrable 
functions that converges a.e. [/i], and let g be another integrable function. 
Suppose that for all C 6 A, 



lim / g n (s)dfi(s) = / g(s)dfj,(s). 
Jc Jc 



Prove that lim n -oo 9n = g, a.e. [/z]. 
Section A. 5: 



22. Prove Proposition A.66 on page 595. 

23. Let (S u Ai) and (S 2 ,A 2 ) be measurable spaces, and define the product 
space (Si x 5 2 , Ai ® A2). Prove that A x B € Ai 0 A2 with A C Si and 
that B C S 2 implies A G Ai and B G Ai- (Hint: For each C 6 Ai 0 A 2 , 
define C y = {x : (x,y) G C}. Then let C = {C : C y £ A u for all y G S 2 }. 
Prove that C is a cr-field containing all product sets.) 



A.7. Problems 605 



Section A. 6: 

24. Suppose that in <C \i2 and 

(a) Show that a.e. [/xi] means the same thing as a.e. [/12]. 

(b) Show that 

S (s)= (Sr (s) ) ' ae - N and a.e. [ M2 ]. 

25. If \l\ is a measure and / is a nonnegative measurable function, then define 
the measure [12 by 112(A) = J /(s)d/ii(s). Prove that /12 Mi . 

26. Let A be Lebesgue measure on IR and define 

ti(A) = \{A) + cI A (x 0 ), 

for some fixed c > 0 and xo £ IR. 

(a) Prove that \i is a measure. 

(b) Show that A < /i, but that /z f£ A. 

(c) Show that / f(x)dfi(x) = / /(x)dA(x) + c/(x 0 ). 

27. *In the proof of Theorem A. 74, we proved the Hahn decomposition theorem 

for signed measures, namely that if v is a signed measure on {S,A), then 
there exists A 6 A such that A is a positive set and A c is a negative set 
relative to v. 

(a) Let v be a signed measure on (S,A). Suppose that there are two dif- 
ferent Hahn decompositions. That is, Ai and A2 are both positive sets 
and Ax and A2 are both negative sets. Prove that every measurable 
subset B of Ai fl A$ has i/(JB) = 0. 

(b) If v is a signed measure on (5,^4), use the Hahn decomposition the- 
orem to create definitions for the following: 

i. The integral with respect to v of a measurable function. 

ii. When a function is integrable with respect to v. 

(c) If there are two different Hahn decompositions for a signed measure 
1/, prove that the definition of integral with respect to v produces the 
same value for both decompositions. 

28. In the statement of Proposition A. 87 on page 602, prove that the measure 
\x\ is well defined. (That is, suppose that A = f~ 1 (B\) = f~ 1 (B2), and 
prove that /i2(#i H T) = /i2(#2 H T).) Also prove that fii is a measure. 

29. In the statement of Proposition A.87 on page 602, assuming that \x\ is a 
well-defined measure, prove that (A.88) holds. 



Appendix B 
Probability Theory 



This appendix builds on Appendix A but is otherwise self-contained. It contains 
an introduction to the theory of probability. The first section is an overview. 
It could serve either as a refresher for those who have previously studied the 
material or as an informal introduction for those who have never studied it. 



B.l Overview 

B.l.l Mathematical Probability 

The measure theoretic definition of probability is that a measure space (S,A,ii) 
is called a probability space and /x is called a probability if /x(S) = 1. Each element 
of A is called an event A measurable function X from S to some other space 
(X, B) is called a random quantity. The most popular type of random quantity 
is a random variable, which occurs when X is IR with the Borel cr-field. The 
probability measure /x* induced on (X, B) by X from /x is called the distribution 
ofX. 

Example B.l. Let S = X = IR with Borel tr-field. Let / be a nonnegative 
function such that / f{x)dx = 1. Define 11(A) = J A f(x)dx and X(s) = s. Then 
X is a continuous random variable with density /, and fix = V- If we let v denote 
Lebesgue measure, then fix <^ v with dfix/dv = f. 

Example B.2. Let S = IR with Borel afield. Let X = {x u x 2 , . . . js a countable 
set. Let / be a nonnegative function defined on X such that £t=i f( Xi ^ = 
Define /z(.4) = 6i4} /(**)• Then x is a discrete random variable with prob- 

ability mass function /, and fix = If w <> let " denote counting measure on X, 
then [i < 1/ with cfyx/di/ = /. 



B.l. Overview 607 



In both of these examples, we will say that / is the density of X with respect 
to v. 

When there is one probability space (5, A, /x) from which all other probabilities 
are induced by way of random quantities, then the probability in that one space 
will be denoted Pr. So, for example, if fix is the distribution of a random quantity 
X and if B e tf, then Pr(X e B) = ^(X' l (B)) = nx(B). 

The expected value or mean or expectation of a random variable X is defined 
(and denoted) as E{X) — f xdfix(x), if the integral exists, where (jlx is the 
distribution of X. If X is a vector of random variables (called a random vector), 
then E(X) will stand for the vector with coordinates equal to the means of the 
coordinates of X. 

The (in) famous law of the unconscious statistician, B.12, is very useful for 
calculating means of functions of random quantities. It says that E[/(X)] = 
/ f(x)dfix(x). For example, the variance of a random variable X with mean c is 
Var(A") = E([X - c] 2 ), which can be calculated as j(x - c) 2 dnx{x). The covari- 
ance between two random variables X and Y with means cx and cy , respectively, 
is Cov(X, Y) = E([X - c x ]\Y - cy]). 



B . 1 . 2 Conditioning 

We begin with a heuristic derivation of the important concepts using the special 
case of discrete random quantities. Afterwards, we define the important terms in 
a more rigorous way. 

Consider the case of two random quantities X and Y, each of which assumes at 
most countably many distinct values, X € X = fan, . . .} and Y € y = {v\ i 
Let Pij = Pr(X = x u Y = Vj ). Then ' ' ' 

oo 

Pt(X = Xi) = Yl p v = ^ ' and 
Pr(y = y ,) = Jpo-pj. 

These equations give the marginal distributions of X and Y, respectively. We can 
define the conditional probability that X = x t given Y = Vj by 



Pv(X = x i \Y = y j ) = PH =p 



Note that for each j, £ p<|j . = i so that the numbers {p<| }gi define a 
bility distribute on * known as the conditional distribution of X given Y = y 
We can calculate the conditional mean (expectation) of a function / of X given 
Y = 2/j by 

oo 

E(/W|y = to ) = £/(*<)* u . 

From the conditional distribution, we could define a measure on (X,2 X ) by 

Mx|y(>%-)= 



608 Appendix B. Probability Theory 



It follows that, for each j, E(f(X)\Y = y,) = J f(x)dn X \Y(x\yj). We can think 
of this conditional mean as a function of y: 

g(y) = E(f(X)\Y = y). 

The marginal distribution of Y is a measure on (J>, 2 y ) denned by 

fiy(B) = Ph for all B € 2 y . 

Similarly, the joint distribution of (X, Y) induces a measure on (X x y, 2 X <g) 2 y ) 
by Mx,y(C) = Z^y^ec^' for a11 C e 2 * ® The P oint of a11 of these 
measures and distributions is the following. We can write the integral of g over 
any set B £ 2 y as 



y 3 €B Vj€B i=l 



mi B (v)d»x,Y(x,v) = E(/(X)/ B (V)). 



The overall equation 



g(y)dfiY(y) = E(f(X)I B (Y)) 



will be used as the property that defines conditional expectation in general. 
Through the definition of conditional expectation, we will define conditional prob- 
ability and conditional distributions in general. 

Theorem B.21 says that, in general, if a random variable X has finite mean and 
if C is a sub-a-field of A, then a function g : S 1R exists which is measurable 
with respect to the a-field C and such that 

E{XI B )= I g(s)dfi(s), for all (B.3) 

JB 

This is the general version of what we worked out above for discrete random 
variables in which C was the cr-field generated by Y. We will use the symbol 
E(X\C) to stand for the function g. The two important features that E(X\C) 
possesses are that it is measurable with respect to the a-field C and that it satisfies 
(B.3). Any function that equals E(X\C) a.s. [/x] will also satisfy (B.3), so there 
may be many functions that satisfy the definition of conditional expectation. All 
such functions are called versions of the conditional expectation. When we say 
that a random variable equals E(X|C), we will mean that it is a version of E(X C). 

Notice that we can set B = S in (B.3) and the equation becomes E(X) = 
E[E(X\C)]. This result is called the law of total probability. A useful generalization 
is given in Theorem B.70. 

If C is the <7-field generated by another random quantity Y, then the symbol 
E(X\Y) is usually used instead of E(X\C). For the case in which C is the a-field 
generated by Y, some special notation is introduced^ We saw in Theorem . A. 42 
that a function is measurable with respect to the afield generated by Y if and 



B.l. Overview 609 



only if it is a function of Y. Hence, there is a function h denned on the space 
y where Y takes its values such that E(X\Y) = h(Y). We use the notation 
E(X\Y = t) to stand for h(t). (See Corollary B.22.) In this notation, we have, for 
all B € C, E(XIb) = j c E(X|F = t)d/iy (t), where fxy is the distribution of Y\ 

Example B.4. Let S = IR 2 and let A be the two-dimensional Borel sets. Let 



Suppose that X(s) = si and = 82 when s = (si, s 2 ). Now E(|X|) = v2/7r < 
00. We claim that g(s) — S2/2 and h(t) = £/2 satisfy the conditions required to be 
E(A"|Y)(«) and E(X|K = t), respectively. First, note that the (r-field generated 
by Y is Ay = {JR x C : C is Borel measurable}, and /iy is the measure with 
density exp(-^ 2 /2)/v / 27r. It is clear that any measurable function of s 2 alone is 
Ay measurable. Let B = IR x C, so that E(XJb) equals 

= /c /oo 5*J? exp {"5 ( 51 - I*)'} vfe ex *> {4 s1 } d5ldS2 

Note also that the third line in the above string equals / h{s 2 )dfj >Y {s 2 ). 

It is easy to see that if X is already measurable with respect to C, then 
E(X|C) = X. 

Conditional probability turns out to be the special case of conditional expec- 
tation in which X = I A . That is, we define Pv(A\C) = E(I A \C). A conditional 
probability is regular if Pr(.|C)(a) is a probability measure for all s. It turns out 
that, under very general conditions (see Theorem B.32), we can choose the func- 
tions Pr(A|C)(-) in such a way that they are regular conditional probabilities. In 
particular, the space (X, B) needs to be sufficiently like the real numbers with the 
Borel or-field. Such spaces are called Borel spaces as denned in Definition B.31. All 
of the most common spaces are Borel spaces. In particular, lR fc for all finite k and 
' For those readers with ™re mathematical background, complete separable 
metric spaces are also Borel spaces. Also, finite and countable products of Borel 
spaces are Borel spaces. 

In the future, we will assume that all versions of conditional probabilities are 
regular when they are on Borel spaces. If C is the a-field generated by Y, then 
Pr(A|F = y) will be used to stand for E(I A \Y = y). 

If X : S -+ X is a random quantity, its conditional distribution is the collection 
of conditional probabilities on X induced from the restriction of conditional prob- 
abilities on S to the afield generated by X. If the P{]C) are regular conditional 



610 Appendix B. Probability Theory 



probabilities, then we say that the version of the conditional distribution of X 
given C is a regular conditional distribution. When we refer to a conditional dis- 
tribution without the word "version," we will mean a version of the conditional 
distribution. Occasionally, we will need to choose a version that satisfies some 
other condition. In those cases, we will try to be explicit about versions. 

Because conditional distributions are probability measures, many of the theo- 
rems from Appendix A which apply to such measures apply to conditional dis- 
tributions. For example, the monotone convergence theorem A. 52 and the dom- 
inated convergence theorem A. 57 apply to conditional means because limits of 
measurable functions are still measurable. Also, most of the properties of proba- 
bility measures from this appendix apply as well. 

We now turn our attention to the existence and calculation of densities for 
conditional distributions. If the joint distribution of two random quantities has 
a density with respect to a product measure, then the conditional distributions 
have densities that can be calculated in the usual way, as the joint density divided 
by the marginal density of the conditioning variable. Theorem B.46 allows us to 
extend this result to joint distributions that are not absolutely continuous with 
respect to product measures, such as when one of the quantities is a function of 
the other. Here, we merely give an example of how such conditional densities are 
calculated. 

Example B.5. Let X = (Xi, X2) have bivariate normal distribution with den- 
sity 



MX1 ' X2) = 2«*Jv^? eX *(-^P>) 



(xi -/ii) 2 



- Hl)(x 2 - H2) , (X 2 - M2) 2 
— £ — _ — — . "T" 2 

G\&2 02 



) 



with respect to Lebesgue measure on IR 2 . The marginal density of Y = Xi + X 2 
with respect to Lebesgue measure is 

/y(y)= vfc ex K~^ [ ^ (Mi+M2)]2 )' 

where a 2 = a\ +a\ + 2 pa 102. The pair (X, Y) does not have a joint density with 
respect to Lebesgue measure on IR 3 , but it does have a joint density with respect 
to the measure v on IR 3 denned as follows. For each A C IR 3 , let A* = {(xi, x 2 ) : 
(x u x 2 ,xi +x 2 ) e A}. Let v(A) = A 2 (A*), where X k is Lebesgue measure on JR k 
for k = 1, 2. Then f x ,Y (x 9 y) = fx(x) is the joint density of (X, Y) with respect 
to 1/, and 

fx(x) 1 ( 1 ( k? + p<n<r2](y-Mi -M2A 2 \ 



/v(y) y/2ni 

if y = xi + x 2 , is the conditional density of X given Y = y with respect to the 
measure v x \y(A\y) = Ai(AJ), where A% = {xi : {x u y - x x ) e A}. 

The concept of conditional independence will turn out to be central to the 
development of statistical models. A collection {X n }~ 1 of random quantities is 



B.l. Overview 611 



conditionally independent given another quantity Y if the conditional distribu- 
tion (given Y) of every finite subset is a product measure. If, in addition, Y is 
constant almost surely, we say that {X n }S=i are independent. We will call ran- 
dom quantities (conditionally) IID if they are (conditionally) independent and 
they all have the same conditional distribution. 



B.1.3 Limit Theorems 

There are three types of convergence which we consider for sequences of random 
quantities: almost sure convergence, convergence in probability, and convergence 
in distribution. The weakest of these is the last. (See Theorem B.90.) A sequence 
{X n }n=\ converges in distribution to X if lim n — ooE(/ (X n )) = E(/(X)) for 
every bounded continuous function /. We denote this type of convergence X n 
X. If X = H, a more common way to express X n X is that lim„_oo F n (x) - 
F(x) for all x at which F is continuous, where F n is the CDF of X n and F is the 
CDF of X. 1 

If X is a metric space with metric d, we say that a sequence {X n }£Li converges 
in probability to X if, for every c > 0, lim n -oo Pr(d(X n ,X) > e) = 0. We write 
this as X n -> X. Almost sure convergence is the same as almost everywhere 
convergence of functions, and it is the strongest of the three. That is, X n -> X, 
a.s. means that {s : X n (s) does not converge to X(s)} C E with Pr(£) = 0. 

A popular method for proving convergence in distribution involves the use of 
characteristic functions. The characteristic function of a random vector X is the 
complex-valued function 

<M<)=E(exp(it T X]). 

It is easy to see that the characteristic function exists for every random vector 
and has complex absolute value at most 1 for all t. Other facts that follow directly 
from the definition are the following. If Y = aX+6, then 0y(t) = <j> x (at) exp(itb). 
If X and Y are independent, <j> x + Y = <t>x<t>Y. 

The importance of characteristic functions is that they characterize distribu- 
tions (see the uniqueness theorem B.106) and they are "continuous" as a function 
th^Tg^ 011 in SGnSe ° f C ° nVergenCe in dis *ribution (see the continuity 

Two of the more useful limit theorems are the weak law of large numbers B.95 
and the central limit theorem B.97. If {X n }~ 1 are IID random variables with 
finiteraan /x then the weak law of large numbers says that the sample average 

n ~ Xi/n conv erges in probability to p. If, in addition, they have finite 

variance a 2 the central limit theorem B.97 says that v^(X n - a) £ AT(0 a 2 ) 
the normal distribution with mean 0 and variance a 2 . ' 

^Problem 25 on page 664. If X = IR fc , the same idea can be used. That 

F^VlA^ an< ! ° n . ly ?.^ e i^-CDFs F n of X n converge to the joint CDF 
* of X at all points at which F is continuous. Since we will not need to use this 
characterization, we will not prove it. 



612 Appendix B. Probability Theory 



B.2 Mathematical Probability 

In this chapter, we will present the basic framework of the measure theoretic 
probability calculus. Most of the concepts like random quantities, distributions, 
an so forth, will be special cases of measure theoretic concepts introduced in 
Appendix A. 

B.2.1 Random Quantities and Distributions 

We begin by introducing the basic building blocks of probability theory. 

Definition B.6. A probability space is a measure space (S,A, /x) with /i(5) = 1. 
Each element of A is called an event. If (5, A, /i) is a probability space, (X, B) 
is a measurable space, and X : S — > X is measurable, then X is called a random 
quantity. If X = IR and B is the Borel or Lebesgue <7-field, then X is called a 
random variable. Let fix be the probability measure induced on (X, B) by X 
from \x (see Definition A. 83). This probability measure is called the distribution 
of X. The distribution of X is said to be discrete if there exists a countable set 
AC X such that iix{A) — 1. The distribution of X is continuous if /zx({#}) = 0 
for all x 6 X. 

The distribution of X is easily seen to be equivalent to the restriction of /x to 
the a-field generated by X, Ax- 

When there is one probability space from which all other probabilities are 
induced by way of random quantities, then the probability in that one space will 
be denoted Pr. So, for example, in the above definition of the distribution of a 
random quantity X, if B 6 B, then Pr(X e B) = fi(X'\B)) = »x(B). 

The distribution of a random variable can be described by its cumulative dis- 
tribution function. 

Definition B.7. A function F is a (cumulative) distribution function (CDF) if 
it has the following properties: 

• F is nondecreasing; 

• lim x _^-oo F(x) = 0; 

• lim x _oo F(x) — 1; 

• F is continuous from the right. 

Proposition B.8. IfX is a random variable, then the function F x (x) = Pr(X < 
x) is a CDF. In this case, Fx is called the CDF of X. 

A distribution function F can be used to create a measure on (IR, B) as fol- 
lows. Set /i((o,6]) = F(b) - F(o), and extend this to the whole a-field using the 
Caratheodory extension theorem A.22. 2 

We can also construct a distribution function from a probability measure on 
the real numbers. If m is a probability measure on (H,^ 1 ), the CDF associated 
with it is F(x) = /i((-<». *])• If / is a Borel measurable mnction from M to H ' 
we will write J f{x)dF(x) and / f{x)d^i(x) interchangeably. 



2 See the discussion on page 581 and Problem 7 on page 



B.2. Mathematical Probability 613 



If is a probability measure on (IR n ,# n ), a joint CDF can be defined as 

F(X!,...,X n ) =/x((-CO,Xl] X ••• X (-00, B n ]), 

the measure of an orthant. For every joint CDF, there is a random vector X with 
that CDF and we call the CDF F x . 

Definition B.9. Let (S, A, fi) be a probability space, and let (X, B, v) be a mea- 
sure space. Suppose that X : S — > X is measurable. Let fix be the measure 
induced on (X, B) by X from fi. Suppose that < v. Then we call the Radon- 
Nikodym derivative fx = dfixjdv the density of X with respect to v. 

Proposition B.10. // h : X JR is measurable and fx is the density of X 
with respect to v, then j h(x)dF x (x) = J h(x)f x (x)du(x). 

Definition B. 11. If X is a random variable with CDF Fx(-), then the expected 
value (or mean, or expectation) of X is E(X) = fxdF x (x). If X is a random 
vector, then E(X) will stand for the vector with coordinates equal to the means 
of the coordinates of X. 

The following theorem is often called the law of the unconscious statistician, 
because some people forget that it is not really the definition of expected value. 

Theorem B.12. 3 // X : S - X is a random quantity and f : X - 2R is a 
measurable function, then E[f(X)] = / f(x)dfjL X (x), where \i x is the distribution 
of X. 

Proof. If we let Y = f(X), then Y induces a measure (with CDF F Y ) on 
(IR,B ) according to Theorem A.81. The definition of E(Y) is J ydF Y (y), and 
Theorem A.81 says that J ydF Y (y) = J f( x )dF x (x). ' a 

Definition B.13. If X is a random variable with finite mean c, then the variance 
of A" is the mean of (X - c) 2 and is denoted Var(X). If X is a random vector 
with finite mean vector c, then the covariance matrix of X is the mean of (X - 
v 1 1 iS al8 ° den ° ted Var(X) - The cw «"««ce of two random variables 
Cov(X Y) finHe mCanS ° X CY iS E(iX ~ CX][Y ~ Cv]) and is denoted 

It is possible for a random variable to have finite mean and infinite variance. 
Proposition B.14. If X has finite mean pt, then Var(X) = E(X 2 ) - fi 2 . 

B.2.2 Some Useful Inequalities 

Although there are theoretical formulas for calculating means of functions of 
random variables, often they are not analytically tractable. We may, on the other 
hand, only need to know that a mean is less than some value. For this reason we 
present some well-known inequalities concerning means of random variables. 

3 This theorem is used in making sense of the notation E, when introducing 
parametric models. 6 



614 Appendix B. Probability Theory 



Theorem B.15 (Markov inequality). 4 Suppose that X is a nonnegative ran- 
dom variable with finite mean //. Then, for all c > 0, Pr(X > c) < /i/c. 

Proof. Let F be the CDF of X. Then, we can write 

H = / xdF{x) > / xdF{x) >c I dF{x) = cPr(X > c). 

J «/[c,oo) «/[c,oo) 

Divide the extreme parts by c to get the result. □ 
The following well-known inequality follows trivially from the Markov inequal- 
ity B.15. 

Corollary B.16 (Tchebychev's inequality). 5 Suppose that X is a random 
variable with finite variance a 1 and finite mean /i. Then, for all c > 0, 

Pr(|X- M |>c)<^. 

Another well-known inequality involves convex functions. 6 The proof of this 
theorem resembles the proofs in Ferguson (1967) and Berger (1985). 

Theorem B.17 (Jensen's inequality). 7 Let g be a convex function defined on 
a convex subset X of IR k and suppose that Pr(X G X) = 1. If E(X) is finite, 
then E(X) G X and g(E(X)) < E(g(X)). 

PROOF. First, we prove that E(X) G X by induction on the dimension of X. 
Without loss of generality, we can assume that E(X) = 0, since we can subtract 
E(X) from X and from every element of X, and E(X) G X if and only if 0 € 
X - E(X). If k = 0, then X = {0} and E(X) = 0. Suppose that 0 G X for all 
X with dimension strictly less than m < k. Now suppose that X and X have 
dimension m and 0 £ X. Since X and {0} are disjoint convex sets, the separating 
hyperplane theorem C.5 says that there is a nonzero vector v and a constant c 
such that, for every x € X, v T x < c and 0 > c. 8 If we let Y = v T X, then we 
have Pr(y < c) = 1 and E(Y) = 0 > c. It follows that Pr(y = c) = 1 and c = 0. 
Hence, X lies in the (m - l)-dimensional convex set Z - X D {x : v x = 0}. It 
follows that 0 G Z C X. 

Next, we prove the inequality by induction on k. For k = 0, E(flf(X)) = 
#(E(X)), since X is degenerate. Suppose that the inequality holds for all di- 
mensions up to m - 1 < k. Let X have dimension m. Define the subset of IR m , 

X' = {(a, 2) : x G 2 G IR, and y(x) < 

Let (xi,zi) and (x 2 ,2 2 ) be in x ' and denne 

(y, w) = (axi + (1 - a)x 2 , azi + (1 - at)z 2 ). 



4 This theorem is used in the proofs of Corollary B.16 and Lemma 1.61. 
5 This corollary is used in the proof of Theorem 1.59. 

6 Let X be a linear space. A function / : X -+ 1R is convex if /(Ax + (1 - X)y) < 
A/(x) + (1 - A)/(y) for all x,y € X and all A 6 [0, 1]. 

7 This theorem is used in the proofs of Lemma B.114 and Theorems B.118 
and 3.20. 

8 The symbol v T stands for the transpose of the vector v. 



B.3. Conditioning 615 



Since ag(xi) + (l — a)g(x2) > g(y) and w > ag{x\) + (1 — a)g(x 2 ), it follows that 
(y,w) e A", so X' is convex. It is also clear that (E(X),g(E(X))) is a boundary 
point of X'. The supporting hyperplane theorem C.4 says that there is a vector 
v = (v x ,v z ) such that, for all (x,z) € X\ vjx 4- v z z > vjE(X) + v z g(E(X)). 
Since (x,zi) 6 X 1 implies (#, z 2 ) E for all 2:2 > zi , it cannot be that v z < 0, 
since then lim*— 00 vjx + v z z = —00, a contradiction. Since (x, (?(x)) € for all 
x £ A\ it follows that vJX + ^p(X) > vjE(X) + v,p(E(X)), from which we 
conclude 

v.0(E(X)) < vJ[X-E(X)]+v,p(X). (B.18) 

Taking expectations of both sides of this gives v z g(E(X)) < v z g(X). If v z > 0, the 
proof is complete. If v» = 0, then (B.18) becomes 0 < v T [X —E(X)] which implies 
v T [X — E(X)] = 0 with probability 1. Hence X lies in an (m — l)-dimensional 
space, and the induction hypothesis finishes the proof. □ 
The famous Cauchy-Schwarz inequality for vectors 9 has a probabilistic version. 

Theorem B.19 (Cauchy-Schwarz inequality). 10 LetXi andX 2 be two ran- 
dom vectors of the same dimension such that E(||Xt|| 2 ) is finite fori = 1,2. Then 

EdX^^I) < VeiixxPeh^h 2 . (B.20) 

Proof. Let Z = 1 if X?X 2 > 0 and Z = -1 if Xj X 2 < 0. Let Y = \\Xi + 
cZX 2 || 2 , where c = -y/E\\Xi\\ 2 /E\\X 2 \\ 2 . Then Y > 0 and Z 2 = 1. So 

0 < E(r) = E||X 1 || 2 + c 2 E||X 2 || 2 + 2cE(|XrX2|) 
The desired result follows immediately from this inequality. □ 



B.3 Conditioning 

B.3.1 Conditional Expectations 

Section B.1.2 contains a heuristic derivation of the important concepts in condi- 
tioning using the special case of discrete random quantities. We now turn to a 
more general presentation. 

Theorem B.21. 11 Let (S,A, /x) be a probability space, and suppose that X : S — ► 
IR is a measurable function with E(|X|) < 00. Let C be a sub-a-field of A. Then 
there exists a C measurable function g : S — ► IR which satisfies 

E{XI B )= I g(s)dfj,(s), for all B€ C. 
Jb 



9 That is, if x\ and x 2 are vectors, then |x^X2| < ||xi||||x2||. 
10 This theorem is used in the proofs of Theorems 3.44, 5.13, and 5.18. 
11 This theorem is used to help define the general concept of conditional 
expectation. 



616 Appendix B. Probability Theory 



Proof. Use Theorem A. 54 to construct two measures /i+ and [i- on (S,C): 
M+(B)= f X+(s)dn(s\ »-(B)= [ X-(s)d»(s). 

J B J B 

It is clear that fi+ <^ \i and fi. The Radon-Nikodym theorem A.74 tells us 

that there are C measurable functions g+ and g~ such that 

/i+(B)= / g+{s)dn(s), n-(B)= / g-(s)d^(s). 

J B J B 

Since E(A7b) = ti+(B) - the result follows with g = g+ - p_. □ 

We will use the symbol E(X|C) to stand for the function g. If C is the tr-field 
generated by another random quantity Y, then the symbol E(X|V r ) is usually 
used instead of E(X\C). For the case in which C is the cr-field generated by Y ) 
the next corollary follows from Theorem B.21 with the help of Theorem A.42. 

Corollary B.22. Let (S, A, /x) be a probability space, and let (y,C) be a mea- 
surable space such that C contains all singletons. Suppose that X : S — ► IR and 
Y : S —> y are measurable functions and E(|X|) < oo. Let /jly be the measure 
induced on {y,C) by Y from fi (see Theorem A. 81). Let Ay be the sub-o-field 
of A generated by Y. Then there exists a function h : y — » IR that satisfies the 
following: If B € Ay equals Y~ l (C) for C € C, then E(XI B ) = J c h(t)dfi Y {t) . 

We will use the symbol E(X\Y = t) to stand for the h(t) in Corollary B.22. 
At this point the reader might wish to review Example B.4 on page 609. 
To summarize the above results, we state the following. 

Definition B.23. Let (5,^1,/i) be a probability space, and suppose that X : 
S — > IR is measurable and E(|X|) < oo. Let C be a sub-cr-field of A. We define 
the conditional mean (conditional expectation) of X given C denoted E(X\C) to 
be any C measurable function g : S — ► IR that satisfies 

E(XI B )= I g(s)dn(s), forallBeC. 
Jb 

Each such function is called a version of the conditional mean. If Y : S — ► 3^ and 
C is the sub-cr-field generated by V, then E(X\C) is also called the conditional 
mean of X given Y, denoted E(X\Y). If, in addition, the cr-field of subsets of y 
contains singletons, let h : y IR be the function such that g = h(Y). Then 
h(t) is denoted by E(X\Y = t). 

When we say that a random variable equals E(X\Y), we will mean that it is 
a version of E(X\Y). The following propositions are immediate consequences of 
the above definitions. 

Proposition B.24. Let (S,A,ii) be a probability space, and let (y,C) & e a mea ~ 
surable space such that C contains singletons. Let X : S -> IR and Y : S -+ y 
be measurable. Let [iy be the measure on y induced from /x by Y. A func- 
tion g : y IR is a version of E(X\Y = t) if and only if for all B G C, 
J B g(t)dnY(t) = E(XI B (Y)). 



B.3. Conditioning 617 



Proposition B.25. 

• If Z and W are both versions ofE(X\C), then Z = W, a.s. 

• If X is C measurable, then E(X\C) = X, a.s. 

Proposition B.26. If C = {5,0}, the trivial a-field, then E{X\C) = E(X). 

Proposition B.27. 12 Let (S,„4,/i) be a probability space, and let (y,C) be a 
measurable space. Let X : S —> IR and Y : S — ► y be measurable, and let 
g : y —> IR be such that g(Y)X is integrable. Let fiy be the measure on y induced 
from \i by Y. Then E{g{Y)X) = j g(t)E{X\Y = t)dfi Y {t). 

Proposition B.28. 13 Let (5,^4,/x) be a probability space and let X : S — ► M, 
Y : S — > (y,Bi), and Z : S — ► (£,82) be measurable functions. Let \iy and \iz 
be the measures induced on y and Z by Y and Z, respectively, from /i. Suppose 
that E(|X|) < 00 and that Z is a one-to-one function ofY, that is, there exists 
a bimeasurable h : y -+ Z such that Z = h(Y). Then E(X\Y = y) = E{X\Z = 
h{y)), a.s. [/xy]. 

Conditional probability is the special case of conditional expectation in which 
X^I A . 

Definition B.29. Let be a probability space. For each A e A, the 

conditional probability of A given C (or given Y if C is the a -field generated by 
Y)is Pr (A\C) = E(I A \C). If Pr(-|C)(s) is a probability on (5, A) for all seS, then 
the conditional probabilities given C are called regular conditional probabilities. 

It turns out that under very general conditions (see Theorem B.32), we can 
choose the functions Pr(^4|C) in such a way that they are regular conditional 
probabilities. In the future, we will assume that this is done in all such cases. 
If C is the (7-field generated by Y, then Pr(A|Y = y) will be used to stand for 
E(iU|^ = y) as in the discussion following Corollary B.22. 

If X : S — ► X is a random quantity, its conditional distribution is the collec- 
tion of conditional probabilities on X induced from the restriction of conditional 
probabilities on S to the cr-field generated by X. 

Definition B.30. Let (S,A,fi) be a probability space and let {X, B) be a mea- 
surable space. Suppose that X : S — ► X is a measurable function. Let P be the 
probability on (X y B) induced by X from /i. Let C be a sub-cr-field of A. For 
each B eB, let P{B\C) = Pr(A|C), where A = We say that any set of 

functions from S to [0, 1] of the form 

{P(£|C)(0, for all B e B} 

is a version of the conditional distribution of X given C. If C is the cr-field gen- 
erated by another random quantity Y : S — ► ^, a version of the conditional 



12 This proposition is used in the proof of Theorem B.64. 

13 This proposition is used to facilitate the transition from spaces of probability 
measures to subsets of Euclidean space when parametric models are introduced. 
It is also used in the proof of Theorem 2.114. 



618 Appendix B. Probability Theory 



distribution of X given Y is specified by any collection of probability functions 
of the form 

{Pr(.|Y = t), for all t G y}. 

If the P(-\C) are regular conditional probabilities, then we say that the version 
of the conditional distribution of X given C is a regular conditional distribution. 

When we refer to a conditional distribution without the word "version," we 
will mean a version of the conditional distribution. Occasionally, we will need to 
choose a version that satisfies some other condition. In those cases, we will try 
to be explicit about versions. 

If X is sufficiently like the real numbers, there will be versions of conditional 
distributions that are regular. We make that precise with the following definition. 

Definition B.31. Let (X,B) be a measurable space. If there exists a bimeasur- 
able function <f> : X — ► R, where R is a Borel subset of JR, then (X, B) is called a 
Borel space. 

In particular, we can show that all Euclidean spaces with the Borel or-fields 
are Borel spaces. (See Lemma B.36.) First, we prove that regular conditional 
distributions exist on Borel spaces. The proof is borrowed from Breiman (1968, 
Section 4.3). 

Theorem B.32. Let (S,A,fJ,) be a probability space and let C be a sub-a-field of 
A. Let (X, B) be a Borel space. Let X : S — ► X be a random quantity. Then there 
exists a regular conditional distribution of X given C. 

Proof. Let </> : X — > R be the function guaranteed by Definition B.31. Define 
the random variable Z = </>(X) : S R C JR. First we prove that the a- 
field generated by X, Ax, is contained in the cr-field generated by Z, Az> Let 
B € Ax) then there is C € B such that B = X~ l (C). Since <f> is one-to-one, 
<t>~ 1 \<t>(C)) = C. Since 0 _1 is measurable, <t>(C) is a Borel subset of R. Now, 
Z~ l ((f>(C)) = X~ l {C) = B, hence B e Az- It is also easy to see that Az is 
contained in Ax, so they are equal. If Z has a regular conditional distribution, 
then so does X. The remainder of the proof is to show that Z has a regular 
conditional distribution. 

For each rational number q, choose a version of Pr(Z < q\C) and let 



Afc 



f , r = {s : Pr(Z < q\C)(s) < Pr(Z < r\C)(s)}, M = (J M q , r 



q>r 



According to Problem 3 on page 662 and countable additivity, /x(M) = 0. Next, 
define 

N q — {s: lim Pr(Z < r\C)(s) ± Pr(Z < q\C)}, N = |J N q . 

r I q, r rational A11 

We can use Problem 3 on page 662 once again to prove that fi(N q ) = 0 for all q, 
hence fjt{N) = 0. Similarly, we can show that /i(L) = 0, where L is the set 



lim 

r — > — oo 
r rational 



Pr(Z < r\C)(s) * 0 \ [j < 



lim 

r — ► oo 
r rational 



Pr(Z < r\C)(s) ^ 1 



B.3. Conditioning 619 



If G is an arbitrary CDF, we can define 

if s € M U N U L, 



r g(z) 



F(z\C)(s)-^ Vm >rational Pr(Z<r|C) otherwise. 

F(-\C)(s) is a CDF for every s (see Problem 2 on page 661), and it is easy to 
check that F(z\C) is a version of Pr(Z < z\C) for every z. If we extend F(-\C)(s) 
to a probability measure rj(-;s) on the Borel cr-field for every s, we only need to 
check that, for every Borel set B, rj(B; •) is a version of Pr(Z G B\C). That is, for 
every C G C, we need 



77(£; s)djz(s) = Pr({Z G 5} n C). (B.33) 

c 

By construction, (B.33) is true if B is an interval of the form (— oo,z]. Such 
intervals form a 7r-system II such that B is the smallest cr-field containing II. If 
we define 

0l(B) - p7(C) ' Q2(B) - p?(cj ' 

we see that Q\ and Q2 agree on II. Tonelli's theorem A. 69 can be used to see 
that Qi is count ably additive, while Qi is clearly a probability. It follows from 
Theorem A. 26 that Q\ and Q2 agree on B. □ 
Note that the only condition required for regular conditional distributions to 
exist is a condition on the space of the random quantity for which we desire a 
regular conditional distribution. The cr-field C, or the random quantity on which 
we condition, can be quite general. In the future, if we assume that (X, B) is a 
Borel space, we can construct regular conditional distributions given anything we 
wish. Also, since the function in the definition of Borel space is one-to-one and 
the Borel cr-field of IR contains singletons, it follows that the cr-field of a Borel 
space contains singletons (cf. Theorem A.42). 



B.3.2 Borel Spaces* 

In this section we prove that there are lots of Borel spaces. First, we prove that 
every space satisfying some general conditions is a Borel space, and then we will 
show that Euclidean spaces satisfy those conditions. Then, we show that finite 
and countable products of Borel spaces are Borel spaces. The most general type 
of Borel space in which we shall be interested is a complete separable metric 
space (sometimes called a Polish space). 

Definition B.34. Let A* be a topological space. A subset D of X is dense if, for 
every x G X and every open set U containing x, there is an element of D in U. 
If there exists a countable dense subset of X, then X is separable. Suppose that 
X is a, metric space with metric d. A sequence {x n }^° =1 is Cauchy if, for every c, 
there exists N such that m, n > N implies d(x n ,Xm) < e. A metric space X is 
complete if every Cauchy sequence converges. A complete and separable metric 
space is called a Polish space. 



*This section may be skipped without interrupting the flow of ideas. 



620 Appendix B. Probability Theory 



We would like to prove that all Polish spaces are Borel spaces. First, we prove 
that IR°° is a Borel space (Lemma B.36). Then we prove that there exist bimeasur- 
able maps between Polish spaces and measurable subsets of fft°° (Lemma B.40). 
The following simple proposition pieces these results together. 

Proposition B.35. If X is a Borel space and there exists a bimeasurable function 
f : y — ► X, then y is a Borel space. 

Lemma B.36. The infinite product space JR°° is a Borel space. 

Proof. The idea of the proof 14 is the following. We start by transforming each 
coordinate to the interval (0, 1) using a continuous function with continuous 
inverse. For each number in (0, 1) we find a base 2 expansion, which is a sequence 
of 0s and Is. We then take these sequences (one for each coordinate) and merge 
them into a single sequence, which we then interpret as the base 2 expansion of 
a number in (0, 1). If this sequence of transformations is bimeasurable, we have 
our function (j>. 

Let V : 1R°° (0, 1)°° be defined by 

x / 1 tan _1 (xi) 1 tan -1 (x 2 ) \ 

which is bimeasurable. For each x £ [0,1), set yo(x) = x and for j = 1,2,..., 
define 

z(x) - i 1 if2 ^ l W^' 
*jW ~ \ 0 if not, 

yj(x) = 2yj-i(x) - Zj(x). 

For each j, Zj is a measurable function. It is easy to see that zj(x) is the jth digit 
in a base 2 expansion of x with infinitely many 0s. Note also that yj(x) € [0, 1) 
for all j and x. 

Create the following triangular array of integers: 

1 

2 3 

4 5 6 

7 8 9 10 

11 12 13 14 15 



Let the jth integer from the top of the ith column be £(i,j). Then 

WJ ).«S+!2 +J0 . 1) + li=iffi=a. 



14 This proof is adapted from Breiman (1968, Theorem A.47). 



B.3. Conditioning 621 



Clearly, each integer t appears once and only once as £(i, j) for some i and j. 15 
Define 

■■)=£; £;§Sgf. 0**0 

i=i j=i 

Then h is clearly a measurable function from (0, 1) 00 to a subset R of (0, 1). There 
is a countable subset of (0, 1) which is not in the image of h. These are the numbers 
with only finitely many 0s in one or more of the subsequences {^(ijj)}^! of their 
base 2 expansion for i = 1, 2, — For example, the number c = YliLo 2 -l(t+1)/2-1 
is not in R. 16 Since the complement of a countable set is measurable, the set R 
is measurable. 

We define <j> = h(ip). If we can show that h has a measurable inverse, the proof 
is complete. For each xGi?, define 

Mx) = f2 Zj ^- (B-38) 

Clearly, each <f>i is measurable. Note that, for each i and j, 

Zj(M*)) = (B.39) 

Combining (B.37), (B.38), (B.39), and the fact that every integer appears once 
and only once as £(z, j) for some i and j, we see that h((j)i(x), 02(x), . . .) = x, so 
that (0i,</>2, • • •) is the inverse of h and it is measurable. □ 

Lemma B.40. If (X, B) is a Polish space with the Borel a -field and metric d, 
then it is a Borel space. 17 

PROOF. All we need to prove is that there exists a bimeasurable / : X — ► G, where 
G is a measurable subset of 1R°°. We then use Lemma B.36 and Proposition B.35. 

Let {x n }£Li De a countable dense subset of X, and let d be the metric on X. 
Define the function / : X IR°° by 

f(x) = (d(x, xi), d(x, x 2 ), . . .)• 

We will first show that / is continuous, which will make it measurable. Suppose 
that {y n }^Li is a sequence in X that converges to y € X. The kth coordinate of 
f(yn) is d(y n ,Xfe), which converges to d(y,Xk) because the metric is continuous. 
Hence, each coordinate of / is continuous, and / is continuous. Next, we prove 
that / is one-to-one. Suppose that f(x) = f(y). Then <i(x,x n ) = d(y,x n ) for 



15 It is easy to check the following. For each integer t, let k = inf{n : t < 
n(n + l)/2}. Then r(t) = 1 + k(k + l)/2 - t and s(t) = k + 1 - r(t) have the 
property that -£(r(£), s(£)) = t> r(£(i,j)) = i, and s(^(i, j)) = j. 

16 This number corresponds to having Is in the first column of the triangular 
array but nowhere else. Clearly, 0 < c < 1, but it is impossible to have Is in the 
entire first column, since this would require xi = 1. Even if xi = 1 had been 
allowed, its base 2 expansion would have ended in infinitely many 0s rather than 
infinitely many Is. 

17 This proof is adapted from p. 219 of Billingsley (1968) and Theorem 15.8 of 
Royden (1968). 



622 Appendix B. Probability Theory 



all n. Since {#n}£Li is dense, there exists a subsequence {x nj }jii such that 
limj-^oo x nj = x. It follows that 0 — lim^— oo d(x,x nj ) = linij— oo d{y, x nj )\ hence 
limj_oo x nj = y, and y = x. 

Next, we prove that f~ l : f(X) — ► # is continuous. Suppose that a se- 
quence of points {/(y n )}£Li converges to /(y). Let limj— 00 x nj = y. Then 
lim j ^00 d(y ^ x nj ) = 0. But d(y,x nj ) is the n-, coordinate of /(y), which in turn 
is the limit (as n — ► 00) of the rij coordinate of /(y n ). For each j, d(y n ,y) < 
d(y n ,x nj ) + d(y, x nj ). Let e > 0 and let j be large enough so that d(y, x nj ) < e/2. 
Now, let N be large enough so that n> N implies d(y n , x nj ) < d(y, a? nj - ) + c/2. It 
follows that, if n > N, d(y n , y) < e. Hence limn— 00 yn = y and / -1 is continuous, 
hence measurable. 

Finally, we will prove that the image G of / is a measurable subset of IR 00 . We 
willdo this by proving that G is the intersection of countably many open subsets 
of G. 18 Let G n be the following set: 

{x e JR°° : 3 O x a neighborhood of x with d(a, b) < 1/n for all a, be f~ l (O x )}. 

Since O x C G n for each x G G n , G n is open. Also, since / and / _1 are continuous, 
it is easy to see that G C G n for all n. Let G' = G n^L x G n > For each x £ G', 
let O x>n C be such that 0 Xf i D O x ,2 2 ••• and that d(a,b) < 1/n for 
all a,b € /""^Ox.n). Note that f~~ l {O x , n ) 2 f~ l (O x , n +i) for all n. If y n 6 
/ _1 (O x ,n) for every n, then {ynjn'Li is a Cauchy sequence, since n,m > N 
implies d(y n ,ym) < 1/N. Hence, there is a limit y to the sequence. It is easy to 
see that if there were two such sequences with limits y and y', then d(y,y') < e 
for all e > 0, hence y — y'. So we can define a function h : G' — ♦ X by /i(x) = y. 
If x € G, then clearly h(x) = /"^(x). If x' 6 O x ,n, then d(/i(x), h(x')) < 1/n, so 
/i is continuous. We now prove that G' C G, which implies that G = G' and the 
proof will be complete. Let x £ G', and let x n € G be such that x n — ► x. (This is 
possible since G' C G.) Since /i is continuous, / _1 (x n ) —> h(x). If y n = / _1 (x n ) 
and y = /i(x), then y n y and /(y n ) -> /(y) 6 G, since / is continuous. But 
f(y n ) = x n , so /(y) = x, and the proof is complete. □ 
Next, we show that products of Borel spaces are Borel spaces. 

Lemma B.41. Let (X n ,B n ) be a Borel space for each n. The product spaces 
IliLi Xi f° r 0,11 fi nite 71 and rinli Xn with P roduct °-fi elds are Borel spaces. 
PROOF. We will prove the result for the infinite product. The proofs for finite 
products are similar. If X n = IR for all n, the result is true by Lemma B.36. For 
general X n , let fa : X n -► Rn and fa : JR°° -> R* be bimeasurable, where Rn 
and R* are measurable subsets of IR. Then, it is easy to see that 

00 / 00 

4> Y[X n -+fa \ X\Rn 

n=l \n=l 

is bimeasurable, where <£(xi, X2, . . .) = fa(fa(xi), fafa), • • •)• D 
Next, we show that the set of bounded continuous functions from [0, 1] to the 
real numbers is also a Polish space. 



18 We use symbol G to stand for the closure of the set G. The closure of a subset 
G of a topological space is the smallest closed set containing G. A set is closed if 
and only if its complement is open. 



B.3. Conditioning 623 



Lemma B.42. 19 Let C[0, 1] be the set of all bounded continuous functions from 
[0, 1] to JR. Let p{f,g) = sup^o x j \ f{x) — g{x)\. Then, p is a metric on C[0, 1] 
and C[0 y 1] is a Polish space. 

Proof. That p is a metric is easy to see. To see that C is separable, let Dk be the 
set of functions that take on rational values at the points 0, 1/A;, . . . , (k - 1 
and are linear between these values. Let D = U^Lx-D*. The set D is countable. 
Every continuous function on a compact set is uniformly continuous, so let / G 
C[0, 1] and e > 0. Let 6 be small enough so that \x-y\ <6 implies \f(x) -f(y)\ < 
e/4. Let k be larger than 4/e. There exists g G D k such that \g(i/k)-f(i/k)\ < e/4 
for each i = 0, . . . , k. For i/k < x < (i + l)/fc, |/(x) - f{i/k)\ < e/4, and 
\g(x) - g(i/k)\ < e/2, so \f(x) - g(x)\ < e. To see that C[0, 1] is complete, let 
{/n}S£=i be a Cauchy sequence. Then, for all x, {f n (x)}n=i is a Cauchy sequence 
of real numbers that converges to some number f(x). We need to show that the 
convergence of f n to / is uniform. To the contrary, assume that there exists e such 
that, for each n there is x n such that \f n (x n ) ~ f(x n )\ > e. We know that there 
exists n such that m> n implies |/ n (:r) - f m (x)\ < e/2 for all x. In particular, 
\fn(x n ) - fm(x n )\ < e/2 for all m > n. Since lim m _ 00 fm{x n ) = /(x n ), it follows 
that there exists m such that \fm(x n ) - f(x n )\ < e/2, a contradiction. □ 
Because Borel spaces have a-fields that look just like the Borel a-field of the 
real numbers, their (T-fields are generated by countably many sets. The countable 
field that generates the Borel a-field of 1R is the collection of all sets that are 
unions of finitely many disjoint intervals (including degenerate ones and infinite 
ones) with rational endpoints. 

Proposition B.43. 20 Let {X,B) be a Borel space. Then there exists a countable 
field C such that B is the smallest o-field containing C. 

Because a field is a Tr-system, Theorem A.26 and Proposition B.43 imply the 
following. 

Corollary B.44. Let (X,B) be a Borel space, and letC be a countable field that 
generates B. If m and \xi are a -finite measures on B that agree on C, then they 
agree on B. y 



B.3.3 Conditional Densities 

Because conditional distributions are probability measures, many of the theorems 
from Appendix A which apply to such measures apply to conditional distribu- 
tions. For example, the monotone convergence theorem A.52 and the dominated 
convergence theorem A.57 apply to conditional means because limits of measur- 
able functions are still measurable. Also, most of the properties of probability 
measures from this appendix apply as well. In this section, we focus on the exis- 
tence and calculation of densities for conditional distributions. 

If the joint distribution of two random quantities has a density with respect 
to a product measure, then the conditional distributions have densities that can 



This lemma is used in the proof of Lemma 2.121. 



This proposition is used in the proofs of Lemmas 2.124 and 2.126 and The- 
orem 3.110. 



624 Appendix B. Probability Theory 



be calculated in the usual way. 

Proposition B.45. Let (5,.A,/x) be a probability space and let (X,B\,vx) and 
(y,&2)Vy) be a -finite measure spaces. Let X : S — ► X and Y : S — ► y be 
measurable functions. Let Hx,y be the probability induced on (X x y,B\ <8> B2) 
by (X, Y) from /x. Suppose that /xx,v <IC vx xvy. Let the density be /x,y(x,2/). 
Let the probability induced on (}\#2) by Y from fi be denoted \ly . Then \ly is 
absolutely continuous with respect to vy with density 

fy{y) = J fx,y{x,y)dux{x) 1 

and the conditional distribution of X given Y has densities 

, / 1 \ fx,v(x,y) 

with respect to vx- 

This proposition can be proven directly using Tonelli's theorem A. 69 or as a 
special case of Theorem B.46 (see Problem 15 on page 663). 

Theorem B.46. Let (X, B\) be a Borel space, let (J 7 , #2) be a measurable space, 
and let (Xx y, #i®#2, v) be a a -finite measure space. Then, there exists a measure 
vy on (y^Bz) and for each y £ y, there exists a measure Vx\y('\y) on (X,Bi) 
such that for each integrable or nonnegative h : X x y — ► IR, J /i(x, y)dvx\y (x\y) 
is B2 measurable and 



J h{x,y)dv{x,y) = J h{x,y)dv X \y{x\y) 



dvy{y). (B.47) 



Note that 



Proof. Let / be the strictly positive integrable function guaranteed by The- 
orem A.85. Without loss of generality, assume that / f(x,y)dv(x,y) = 1. The 
measure ^{A) = f A f(x,y)dv(x,y) is a probability, v < /x, and (du/dfi)(x,y) = 
l//(x,y). Let fi x \y be a regular conditional distribution on (X,Bi) constructed 
from /i, and let vy be the marginal distribution on 0>,#2). Define 

"x\y(A\v) = J J^~y) dy>x ^ y) ' 

J lAxB(x,y)dfix\y(x\y) = I B (v)l*x\y(A\y), (B.48) 

which is a measurable function of y because ii X \y is a regular conditional distri- 
bution. Just as in the proof of Lemma A.61, we can use the tt-A theorem A.17 
to show that Jgdfix\y is measurable if g is the indicator of an element of 
the product afield. It follows that / gd»x\y is measurable for every nonneg- 
ative simple function g. By the monotone convergence theorem A.52, letting 
{0n}°°=i be nonnegative simple functions increasing to g everywhere, it follows 
that g(x,y)dnx\y(x\y) is measurable for all nonnegative measurable functions, 
and hence f hdv x \y = Jh/fdn X \y is measurable if h is nonnegative. 



B.3. Conditioning 625 



Next, define a probability rj on (X x y, B\ 0 #2) by 



Ic(x,y)dfjL X \y(x\\ 



dv y (y). 



It follows from (B.48) that 77 and fi agree on the collection of all product sets 
(a 7r-system that generates B\ <g> B2). Theorem A. 26 implies that they agree on 
Bi <8) 02- By linearity of integrals and the monotone convergence theorem A. 52, 
if g is nonnegative, then 



J g(x,y)dr)(x,y) = J g(x 1 y)dfi xly (x\y] 
= J jy 9(x,y)f(x,y)dv x 



dvy{y) 



{Ay) 



dvy{y). (B.49) 



For every nonnegative /i, 



J h(x,y)dv( Xl y) = Jf^ f{Xiy)du{Xiy)= (B<50) 



= / J^) d ^ x ^ = / [/ M^,y)^AT|y(x|y) 



where the second equality follows from the fact that d^i/du = /, the third fol- 
lows from the fact that fi and rj are the same measure, and the fourth follows 
from (B.49). If h is integrable with respect to v, then (B.50) applies to 
h , and and all three results are finite. Also, / \h(x, y)\dv x{y (x\y) is mea- 
surable and uy({y : J \h(x, y)\dv xly (x\y) = 00}) = 0. So / y)dv X \ y {x\) 
and Jh {x,y)dv x \ y {x\y) are both finite almost surely, and their difference is 
/ h(x,y)dv X \ y (x\y), a measurable function. It now follows that (B.47) holds. □ 
The measures v y and v X \y in Theorem B.46 are not unique. In the proof, we 
could easily have defined v y several ways, such as v y {A) = / g(y)fiy(y) for any 
strictly positive function g with finite n y integral. A corresponding adjustment 
would have to be made to the definition of v X \ y \ 



In the special case in which v is a product measure v x x u 2 , it is easy to show 
that can play the role of u x{y (.\y) for all y and that u 2 can play the role of v y 
m Theorem B.46. (See Problem 15 on page 663.) 

There is a familiar application of Theorem B.46 to cases in which X and y are 
buchdean spaces but v is concentrated on a lower-dimensional manifold defined 
by a function y = g(x). 

Proposition B.51. Suppose that X = ffi" and y = IR k , with k < n. Let g ■ 
X ~>y be such that there exists h : X -» 2R"- fc such that v{x) = (g(x),h(x)) 
is one-to-one, is differentiable, and has a differentiable inverse. For y e B k and 
w 6 IR n , define J(y,w) to be the Jacobian, that is, the determinant of the 
matrix of partial derivatives of the coordinates of V -\y,w) with respect to the 
coordinates of y and of w. Let A 4 be Lebesgue measure on 2R\ for each i. Define 



626 Appendix B. Probability Theory 



a measure v on X x y by u(C) = X n {{x : (x,g(x)) E C}). Then, vy equal to 
Lebesgue measure on JR k and vx\y(A\y) — J A . J(y,w)d\ n -k(w) satisfy (.B. 47), 

where A y = {w : v~ l (y,w) E A}. 

We are now in position to derive a formula for conditional densities in general. 21 

Theorem B.52. Let (S,A,n) be a probability space, let (X y Bi) be a Borel space, 
let (y>B2) be a measurable space, and let (X x B\ ®#2, v) be a or -finite measure 
space. Let vy and v#\y be as guaranteed by Theorem B.46. Let X : S — ► X and 
Y : S -> y be measurable functions. Let /xx,y be the probability induced on 
(X x y,B\ ® B2) by (X, Y) from \i. Suppose that /ix,y <C v. Let the density be 
/x,y(#,y). Let the probability induced on (y,B2) by Y from \i be denoted fiy> 
Then, jiy <C vy; for each y ey, 

^(y) = Ml/) = J fx y Y(x,y)dv x \y(x\y); (B.53) 

and the conditional distribution of X given Y = y has density 

/ W («=^r (B54) 

with respect to v x \y('\y). 

Proof. It follows from Theorem B.46 that for all B € B 2 , 



Hy(B) = J lB(y)fx,Y(x J y)du(x,y) 

= J Ib(v) J fx,Y{x,y)dv X \ y (x\y) 



dvy(y). 



The fact that [iy < vy and (B.53) both follow from this equation. Let fix\Y{'\y) 
denote a regular conditional distribution of X given Y = y. For each A E B\ 
and B E B 2 , apply Theorem B.46 with h{x,y) = lA{x)lB{y)fx\Y{x\y)fy{y) to 
conclude 



dnr{y)- 



Hx,y{A x B) = j ^ fx\Y{x\y)dv X \y{x\y) 

Since this is true for all B E #2, we conclude that 

Mx|y(A|y) = J /xiy^ly^iy^ly). 

Hence (B.54) gives the density of Mx|y ( |y) with respect to v*\y(-\y). □ 
The point of Theorem B.52 is that we can calculate conditional densities for 
random quantities even if the measure that dominates the joint distribution is 
not a product measure. When the joint distribution is dominated by a product 



21 The condition that the joint distribution have a density with respect to a 
measure v in Theorem B.52 is always met since v can be taken equal to the joint 
distribution. The theorem applies even if v is not the joint distribution, however. 



B.3. Conditioning 627 



measure, the conditional distributions are all dominated by the same measure. 
(See Problem 15 on page 663.) In general, however, the conditional distribution 
of X given Y = y is dominated by a measure that depends on y. For example, if 
Y = g(X), the joint distribution of (X, Y) is not dominated by a product measure 
even if the distribution of X is dominated. (See also Problem 7 on page 662.) 
Nevertheless, we have the following result. 

Corollary B.55. 22 Let (5,^4,/i) be a probability space, let [y,&2) be a measur- 
able space such that B2 contains all singletons, and let(X> B) be a Borel space with 
vx a a -finite measure on (X,B). Let X : S — ► X and g : X — ► y be measurable 
functions. Let Y = g(X). Suppose that the distribution of X has density fx with 
respect to v x . Define is on (X x y,B\ <8> B 2 ) by u{C) - v*{{x : (x,g{x)) € C}). 
Let /ix,y be the probability induced on (X x y, B\ ® # 2 ) by (X, Y) from fi. Let the 
probability induced on {y,B 2 ) by Y from /i be denoted /xy. Then fi x ,Y < u with 
Radon-Nikodym derivative fx,y(x,y) = fx(x)I {g(x)} (y). Also, the conditions of 
Theorem B.46 hold, and we can write 



(y) = My)= / I{g{x)}(y)fx(x)dv xly (x\y), 

J X 

fx\Y(x\y) = / Vv^tix), 
( 0 otherwise. 

Also, the conditional distribution ofY given X is given by fi Y \x(C\x) = I c (g(x)). 

Proof. Since u x is cr-finite, v is also. Since Y is a function of X, Theorem A 81 
implies that for all integrate /i, / h(x, y)dv{x, y) = / h(x, g{x))dv x {x). The facts 
that f Xt Y has the specified form and that n Y \x is the conditional distribution of 

Y given X follow easily from this equation. □ 
The point of Corollary B.55 is that if Y = g(X), then we can assume that the 

conditional distribution of X given Y = y is concentrated on g'^iy}). 

Example B.56. 23 Let / be a spherically symmetric density with respect to A n , 
Lebesgue measure on 5T\ That is, f(x) = h(x r x) for some function h : R - ]R+ 0 
(the interval [0,oo)) and / h(x T x)d\ n (x) = 1. Let X have density / and let 

V - X X. Let R = V 1/2 , and transform to spherical coordinates: 

x\ = rcos(0i), 

X2 = rsin(0i)cos(0 2 ), 

x n -i = rsin(0 1 )... C os(0 n _ 1 ), 
^ = rsin(0 1 )... s in(0 n _ 1 ). 

The Jacobian is r"- 1 where j is some function of 6 alone. The Jacobian for 
the transformation to v and 0 is v^^j(0)/2. The integral of j(9) over all 0 

23^J iS co ™ l } ar y is used in the proof of Theorem 2.86 and in Example 3.106. 
Ine calculation in this example is used again in Example 4.121. 



628 Appendix B. Probability Theory 

values is 7r n/2 /r(n/2). So, the marginal density of V is 

' 2r(*) 

The conditional density of X given V = v is then 

fx\v(x\v) = —^3 I {v} (x T x) 

with respect to the measure vx\v{C\v) = J c „ t^ n ^ -1 j(0)dA n -i(0)/2, where 

C* = {0 : wi(cos(^i),...,8in(di)...8in(^n-i)) € C}. 
It follows that the conditional distribution of X given V = v is given by 

It is easy to see that Mx|v( k) is the uniform distribution over the sphere of 
radius v in n dimensions. 

Another example was given in Example B.5 on page 610. 




B.3.4 Conditional Independence 

The concept of conditional independence will turn out to be central to the devel- 
opment of statistical models. 

Definition B.57. Let N be an index set, let Y and {Xi} i€ K be random quantities, 
and let At be the <r-field generated by X». We say that {Xi} i€ N are conditionally 
independent given Y if, for every n and every set of distinct indices ti, . . . , i n and 
every collection of sets A\ e A% x , . . . , A n 6 A% n , we have 

Y ) =f[ Pr( ^ |y) ' a ' S - (B ' 58) 

If, in addition, Y is constant almost surely, we say {Xi} i€ H are independent. 

Under the same conditions as above, if all of the conditional distributions of 
the Xi given Y are the same, then we say {Xi} i€N are conditionally IID given 
Y. If, in addition, Y is constant almost surely, we say {Xi}ieH are IID. 

Example B.59. Let F be a joint CDF of n random variables Xi, . . . , X n , and 
let /x be the corresponding measure on lR n . Then \x is a product measure if and 
only if Xi, . . . , X« are independent (see Proposition B.66). 
Example B.60 (Continuation of Example B.56; see page 627). 24 Transform to 
(y, V), where Y = X/V 1/2 . Then, the conditional distribution of Y given V is 
given by 

tM Y{v (D\v) = r(|) tT * J mdXn-W), 



24 This calculation is used again in Example 4.121. 



B.3. Conditioning 629 



where D' = {6 : (cos(0i), . . . ,sin(0i) • • sin(0 n _i)) € D). We note that this 
formula does not depend on v\ hence Y is independent of V. In addition, it is 
easy to see that tiY\v(y\v) is just the uniform distribution over the sphere of 
radius 1 in n dimensions. 

The use of conditional independence in predictive inference is based on the 
following theorem. 

Theorem B.61. 25 Let N be an index set, let Y and {Xi}i e ^ be a collection 
of random quantities, and let Ai be the a -field generated by X{. Then {Xi}i e x 
are conditionally independent given Y if and only if for every n and m and 
every set of distinct indices ti, . . . , i n , ji, . . . , jm and every collection of sets A\ € 
Aii » ■ • . , -An € A n > we have 



Pr f> 



^ Xji i • • • j Xj„ 



Y , 0.5. 



(B.62) 



Proof. For the "if" part, we will assume (B.62) and prove (B.58) by induction 
on n. For n = 1, there is nothing to prove. Assuming (B.58) is true for all n < fc, 
we now prove it for n = k + 1. Let Aj £ Aj for j = 1, . . . , fc + 1. According to 
(B.62) and (B.58) for n = fc, we have 



Pr O 



Y,X ik+l =Pr (fjA, 



y ) = IJprfAiiy). 



It follows that for all £ G *4y, the a-field generated by Y, 



fc+i 



Pr J5p|Ai =Pr Sn^ fc+1 P|A 



JBnA k+ i \ 



y,x fc+1 ( S )d/i( s ) 



= / n pr (^i y )w^w = / ^ fc+1 ( 5 )n pr (Aiy)( 5 )dMW 

« fc . fc+1 

= / Pr(A fc+ i|y)WjJPr(A<|y)Wd/iW= / T\^{Ai\Y){s)dn{ S ). 

jB i=l J B i=1 

The equality of the first and last terms above for all B G Ay means that 
n^ 1 1 Pr(A i |y) = Pr(nJ : : j" 1 1 Ai|y), a.s., which is what we need to complete the 
induction. 

For the "only if part, we will assume (B.58) and prove (B.62). For a function 
g to be the left-hand side of (B.62), it must be measurable with respect to the 
a-field Ay,m generated by y, Xj x , . . . , Xj m , and satisfy 



(B.63) 



This theorem is used in the proofs of Theorems 2.14 and 2.20. 



630 Appendix B. Probability Theory 



for all C e AY,m. Clearly, the right-hand side of (B.62) is measurable with respect 
to Ay,m- If C = Cy n Cx, where Cy € Ay and Cx is in the <r-field generated by 
Xj 1 , . . . , Xj m , then 



Pr [cf^Ai 



Y (sW(s) 



This means that (B.63) holds with £ = Pr (n^^lY) so long as C is of the 
specified form. To show that it holds for all C € Ay,™, we first note that Ay,™ is 
the smallest cr-field containing all sets of the specified form. Clearly, (B.63) holds 
for all sets that are unions of finitely many disjoint sets of the specified form by 
linearity of integrals. These sets form a field C. According to Lemma A.24, for 
each e > 0, there is C € 6 C such that Pr(C £ AC) < c/2. The following facts follow 
trivially: 




g(a)M*) = Pr[C € f]Ai 



< 



6 

r 

e 

2* 



Combining these gives that \J c g(s)dfjL(s) - Pr (Cn^ Ai)\ < e. Since e is arbi- 
trary, (B.63) holds for all C G A Y ,m. D 
A particular case of interest involves three random quantities. Theorem B.64 
says that when there are only two Xs in Theorem B.61, we can check conditional 
independence by checking only one of the equations of the form (B.62). 

Theorem B.64. 26 Let X, Y, and Z be three random quantities, and let Ax, Ay, 
and Az be the a -fields generated by each of them. Suppose that for all A G Ax, 
Pt(A\Y,Z) = Pr(A\Y). Then X and Z are conditionally independent given Y. 

Proof. We need to check that for every A £ Ax and B £ Az, Pt(A n B\Y) = 
Pr(A|Y) Pt{B\Y). Equivalents, for all such A and B, and all C e Ay, we must 
show 

Pr(A HBnC) = Jlc(s) Pr(A\Y)(s) Pr (B\Y)(s)dfi(s). (B.65) 



26 This theorem is used in the proofs of Theorems 2.14 and 2.20. 



B.3. Conditioning 631 



Since we have assumed that Pr(^4|y, Z) — Pv(A\Y), we have that, for all B G Az 
and C € Ay , 

Pt(A fl B 0 C) = J Ic{s)Ib{s) Pr(A\Y)(s)dfi(s). 

We can use Proposition B.27 with g(Y) = Ic Pt(A\Y) and X = 1b to see that 

j Ic(s)I B (s)Pr(A\Y)(s)dti(s) = J I c (s)Pr(A\Y)(s)Pr(B\Y)(s)dfi(s). 

Together, these last two equations prove (B.65). □ 
The following result relates product measure on a product space to independent 
random variables. 

Proposition B.66. Let (S,A,(i) be a probability space and let (7;, ft) (i = 
l,...,n)be measurable space. Let Xi : S T { be measurable for i - 1, . . . , n. Let 
Hi be the measure that Xi induces on T» for each i, and let T n = Ti x • • • x T n , 
B n = Bi®--®B n .Letn* be the measure that (X u ...,X n ) induces on (T n , B n ) 
from fi. Then fx* is the product measure /i n = /i X x • • • x jx n , if and only if the Xi 
are independent. 

The same result holds for conditional independence. 

Corollary B.67. Random quantities X u ...,X n are conditionally independent 
given Y if and only if the product measure of the conditional distributions of 
Xi,...,X n given Y is a version of the conditional distribution of(Xu... X n ) 
given Y. ' 

There is an interesting theorem that applies to sequences of independent ran- 
dom variables, even if they are not identically distributed. 

Theorem B.68 (Kolmogorov zero-one law). 27 Suppose that (S,A,u) is a 
probability space. Let {X n }~ =1 be a sequence of independent random quantities. 
Foreachn, letC n be the a -field generated by (X n ,X n + u . . .) and letC = rf? =1 C n 
Then every set in C has probability 0 or probability 1. 

Proof Let A n be the <r-neld generated by (X u . . . ,X n ). Then C. = U^ =l A n is 
a held. It is easy to see that C is contained in the smallest a-field containing C* 

♦w ?a kS\ Le T a A ' 24 ' f ° r GVery * > °» there exists n and C k e A n such 
that fi(AAC k ) < 1/k. It follows that 

Urn n{C k ) = p(A), 
fc lim M (C fc nA) = fi(A). (B.69) 

Since A e C it follows that A € C n+1 ; hence A and C k are independent for 

72 AS ° W ! ^ ^ C ^ A ) = MM)M(C fc ). It follows from (B.69) that 
fi(A) = ii{AY, and hence either fx{A) = 0 or ii{A) = 1. n 



27 n 



7 This theorem is used in the proofs of Corollary 1.63 and Lemma 7.83, and in 
the discussion of "sampling to a foregone conclusion" in Section 9.4. 



632 Appendix B. Probability Theory 



The cr-field C in Theorem B.68 is often called the tail a -field of the sequence 
{X n }S° = i. An interesting feature of the tail cr-field is that limits are measurable 
with respect to it. 28 (See Problem 21 on page 663.) 

B.3.5 The Law of Total Probability 

Next, we introduce some theorems that are very simple to state for discrete 
random variables but appear to be rather unwieldy in the general case. We will, 
however, need them often. 

Theorem B.70 (Law of total probability). Let (S,A,ti) be a probability 
space, and let Z be a random variable with E(\Z\) < oo. Let C C B be sub-o- 
fields of A. Then E(Z\C) = E(E(Z|B)|C), a.e. [/*]. 

PROOF. Define T = E(Z\B) : S — ► IR, which is any B measurable function 
satisfying E(ZI B ) = J B T(s)d^(s), for all B € B. We need to show that E(Z\C) = 
E(T\C) a.s. [/x]. The function E(T\C) is any C measurable function satisfying 
f c E(T\C)(s)dfi(s) = E(T/ C ), for all C e C. But, since C C tf, C e C implies 
CeB. So, for CeC, 

/ E(T|C)(s)^( 5 ) = E(T/ C ) = / I c (s)T(s)dfi(s) = / T(s)dfi(s) = E{ZI C ), 
Jc J Jc 

where the last equality follows since T = E(Z\B) and CeB. Since E{T\C) is C 
measurable, equating the first and last entries of the above string of equations 
means that E(T\C) satisfies the condition required for it to equal E(Z\C). □ 
When B and C are the a-fields generated by two random quantities X and 
y, respectively, C C B means Y is a function of X. So, Theorem B.70 can be 
rewritten in this case. 

Corollary B.71. Let X : S -> Ui, Y : S -> U 2 , and Z : S 2R 6e measurable 
functions such that E(\Z\) < oo. Suppose that Y is a function of X. Then, 

E(Z\Y) = E{E(Z\X)\Y}, a.s. [/i]. 

The most popular special case of this corollary occurs when Y is constant. 

Corollary B.72. 29 Let (S,A,ii) be a probability space. Let X : S -* Ui and 

Z : S -+ IR be measurable functions such that E(\Z\) < oo. Then, E{Z) - 
E{E(Z|X)}. 

This is the special case of Theorem B.70 when C is the trivial a-field. 

The following theorem implies that if a conditional mean given X depends on 
X only through h(X), then it is also the conditional mean given h(X). 

Theorem B.73. 30 Let (S,A,fi) be a probability space and let B and C be sub- 
a-fields of A with C C B. Let Z : 5 IR be measurable such that E{\Z\) < 



28 The tail a-field will play a role in the proofs of Corollary 1.63 and 
rem 1.49. 

29 This corollary is used in the proof of Theorem B.75. 

30 This theorem is used in the proofs of Theorems 1.49 and 2.6. 



B.3. Conditioning 633 



oo. Then there exists a version of E(Z\B) that is C measurable if and only if 
E(Z\B) = E(Z\C), a.s. 

Proof. For the "if" direction, if E(Z\B) = E(Z|C), a.s. [/x], then E{Z\C) is 
measurable with respect to both C and B, and hence it is a C measurable version 
of E(Z\B). For the "only if" direction, if W is a C measurable version of E(Z\B), 
then W = E(W|C), a.s. [//] by the second part of Proposition B.25. By the law 
of total probability B.70, E(W\C) = E(Z\C), a.s. [/x]. □ 
A useful corollary is the following. 

Corollary B.74. 31 Let (S,A,n) be a probability space. Let (Si,Ai) and (S 2l A 2 ) 
be measurable spaces, and let X : S -> Si and h : Si — ► S 2 be measurable 
functions. Let Z : S -> IR be measurable such that E(\Z\) < oo. Define Y = h(X). 
Then E(Z\X = x, Y = y) = E(Z\X = x) a.s. with respect to the measure on 
(Si x S 2 , Ai ® A 2 ) induced by (X, Y) : S —> Si x S 2 from fi. 

The following theorem deals with conditioning on two random quantities at 
the same time. In words it says that the conditional mean of a random variable 
Z given two random quantities X x and X 2 can be calculated two ways. One is 
to condition on both X x and X 2 at once, and the other is to condition on one 
of them, say X 2l and then find the conditional mean of Z given X u but starting 
from the conditional distribution of (Z, Xi) given X 2 . 

Theorem B.75. 32 Let (S, A, /i) be a probability space and let (X u Bi) for i = 1, 2 
be measurable spaces. Let Xi : S — Xi for i = 1,2 and Z : S -> JR be random 
quantities such that E(\Z\) < oo. Let /j 1)2 , z denote the measure on (Xi x X 2 x 
IR, Bi ®B 2 ®B) induced by {X U X 2 , Z) from /x. (Here, B denotes the Borel a- 
field.) For each (x,y) G Xi x X 2 , let g(x,y) denote E{Z\(X U X 2 ) = ( x ,y)). For 
each A e A and y e X 2f let ^ 2) {A\y) denote Vi{A\X 2 = y). For each y e X 2 , 
let h(x y) denote the conditional mean of Z given Xi = x calculated in the 
probability space (S,A,ii™(-\y)). Then h = g a.s. [/xi, 2 , z ]. 

PROOF. Saying that h = g a.s. [xxi, 2)Z ] is equivalent to saying that 

h(Xi(s),X 2 (s)) = g(Xi(s) y X 2 (s)), a.s. [fi]. 

To prove this we first note that f(s) = h{Xl (a), X 2 (s)) is measurable with respect 
to the ,-fied generated by (jr lf X a ), A Xl ,x 2 . All that remains is to show that 
it satisfies the integral condition required to be E(Z\X U X 2 ). That is, for all 

E{ZI C ) = J f{s)dn(s). (b.76) 

Jf* i e R th \ meaS !J re { * 2 ' B2} iad " Ced by * 2 from First > s «PP^e that 
tZth2 f 1? / A l\™ d , B£A *>- The ^ hypothesis of the theorem 
says that for all A e A Xl ,E(ZI A \X 2 = y ) = £ v)* (a >My). If „, |a (-|v) 

is the probability on (^ B,) induced by ^ from M W(.| y ), then Ml|2 (.| y ) is also 
the conditional distribution of X l given X 2 = y as in Theorem B.46. Suppose 

^This corollary is used in the proof of Theorem 2.14 
This theorem is used in the proof of Lemma 2.120, and it is used in makine 
sense of the notation E„ when introducing parametric models 



634 Appendix B. Probability Theory 



that A = X~ l (D) and B = Xf Then An B = {X u X 2 )~ l {D x F) and 
E(ZJ>v|X2 = y) = J D h(x 1 y)dm(x). By Corollary B.72 and Theorem B.46, we 
can write 

E(Z/^/b) = / / /i(x,y)d/xi| 2 (x|y)dM 2 (2/) 
Jf ./d 

= / h{x,y)dm l2t z(x 1 y,z)= f(s)dfi(s). 

JdxFxJR Jadb 

This proves (B.76) for C = A n B. Let C be the collection of all sets C in .A such 
that (B.76) holds. Clearly S 6 C. If C G C, then C c 6 C since / s f(s)dfi,(s) = 
E(Z). By additivity of integrals, if {Ci}g x 6 C, then UgiCi € C, hence C 
contains the smallest cr-field containing all sets of the form A fl B for A G ^4xi 
and B G ^4x 2 - Theorem A.34 can be used to show that this or-field is Axi,x 2 - n 
If a random variable has finite second moment, then there is a concept of 
conditional variance. 

Definition B.77. Let X : S — * H fc have finite second moment, and let C be a 
sub-cr-field of A. Then the conditional covariance matrix of X given C is defined 
as Var(X|C) = E[(X - E(X\C))(X - E(X\C)) T \C]. 

The following result is easy to prove. 

Proposition B.78. 33 Let X : S -* JR k have finite second moment, and let C be 
a sub-a-field of A. Then Var(X) = EVar(X|C) + Var[E(X|C)]. 



B.4 Limit Theorems 

There are several types of convergence that will be of interest to us. They involve 
sequences of random quantities or sequences of distributions. 

B.4.1 Convergence in Distribution and in Probability 

The simplest type of convergence occurs when the distributions have densities 
with respect to a common measure. The following theorem is due to Scheffe 
(1947). 

Theorem B.79 (Scheffe's theorem). 34 Let {p n }n=i and p be nonnegative 
functions from a measure space (X,B,v) to JR such that the integral of each 
function is 1 and \im n ^oo Pn(x) = p(x), a.e. [i/]. Then 

lim f Pn(x)dv(x)= I p(x)dv(x),forallBeB. 
n -*°° Jb Jb 

Proof. Let S n (x) = p n (x) - p(x), and let 6+ and 6~ be its positive and neg- 
ative parts. Clearly, both lim n -oo# = 0 and lim n -oo S n = 0, a.e. \u\. bince 



33 This proposition is used in the proofs of Theorems 2.36 and 2.86. 
34 This theorem is used in the proofs of Lemma 1.113 and Theorem 1.121. 



B.4. Limit Theorems 635 



0 < S n < p is true, it follows from the dominated convergence theorem A. 57 
that lim n -»oo f B Sn (x)di/(x) = 0 for all B. Since both p n and p are densities, 
j x 6 n (x)dv(x) = 0 for all n. It follows that lim n — <x> J x fit (x)dv(x) = 0. Since 
Ib(x)6£ (x) < S%(x) for all x, it follows from Proposition A. 58 that 

lim / 6t(x)du(x) = 0. 

So, lim n ^oo f B \Pn(x) - p{x))dv(x) = 0 for all B. □ 
Since defining convergence requires a topology, the following definitions require 
that the random quantities lie in various types of topological spaces. 

Definition B.80. Let {X n }%Li be a sequence of random quantities and let X 
be another random quantity, all taking values in the same topological space X. 
Suppose that lim n — oo E (/ (X n )) = E (/ (X)) for every bounded continuous func- 
tion / : X — ♦ H, then we say that X n converges in distribution to X, which is 

written X n — ► X. 

Convergence in distribution is sometimes defined in terms of probability mea- 
sures. The reason is that if X n — ► X, the actual values of X n and of X do not 
play any role in the convergence. All that matters is the distributions of X n and 
ofX. 

Definition B.81. Let {P n }%Li be a sequence of probability measures on a topo- 
logical space (X> B) where B contains all open sets. Let P be another probability 

on (X, B). We say that P n converges weakly 35 to P (denoted P n ^ P) if, for each 
bounded continuous function g : X — ► IR, lim n ->oo / g(x)dP n (x) = J g(x)dP(x). 



35 This is not exactly the same as the concept of weak convergence in normed 
linear spaces [see, for example, Dunford and Schwartz (1957), p. 419]. The col- 
lection of all probability measures on a space (X, B) can be considered a subset 
of a normed linear space C consisting of all finite signed measures u (see Defini- 
tion A. 18) with the norm being sup Bee |f(B)|. Weak convergence of a sequence 
{^n}^=i in this space would require the convergence of L(u n ) for every bounded 
linear functional L on C. Every bounded measurable function g on (X, B) deter- 
mines a bounded linear functional L g on C by L g (v) = J g(x)di/(x), where the 
integral with respect to a signed measure can be defined as in Problem 27 on 
page 605. Hence, weak convergence of a sequence of probability measures would 
require convergence of the means of all bounded measurable functions. In partic- 
ular, lim n —oo Pn(B) = P(B) for all measurable sets B, not just those for which 
P assigns 0 probability to the boundary (see the portmanteau theorem B.83 on 
page 636). Alternatively, we can consider the set of bounded continuous functions 
/ : X —> JR as a normed linear space M with ||/|| = sup x |/(x)|. Then the set of 
finite signed measures C is a set of bounded linear functionals on ftf using the 
definition u(f) = f f(x)du(x). Weak* convergence of a sequence {^n}5£U in C to 
v is defined as the convergence of v n (f) to v(f) for all / E Af. This is precisely 
convergence in distribution. Hence, it would make more sense to call convergence 
in distribution weak* convergence rather than weak convergence. Since the tra- 
dition in probability theory is to call it weak convergence, we will continue to do 
so. 



636 Appendix B. Probability Theory 



It is easy to see that these two types of convergence are the same. 

Proposition B.82. Let P n be the distribution ofX n , and let P be the distribution 
ofX. Then, X n X if and only if P n ^ P. 

Since we will usually be dealing with X spaces that are metric spaces, there are 
some equivalent ways to define convergence in distribution or weak convergence. 
The proofs of Theorems B.83 and B.88 are adapted from Billingsley (1968). 

Theorem B.83 (Portmanteau theorem). 36 The following are all equivalent 
in a metric space: 

1. Pn ^ P ; 

2. limsup n _ 00 P n (B) < P(B) for each closed B; 

3. liminf n _>oo Pn{A) > P{A), for each open A; 

4. limn-.oc P n (C) = P(C), for each C with P(dC) = 0. 37 

Proof. Let d be the metric in the metric space. First, assume (1) and let B be 
a closed set. Let S > 0 be given. For each e > 0, define C e = {x : d(x,B) < c}, 
where d(x,B) = inf yG B d(x, y). Since \d(x, B) — d(y,B)\ < d{x,y), we see that 
d(x, B) is continuous in x. Each C € is closed and n € >oC' e = B. Let e be small 
enough so that P(C e ) < P(B) + S. Let / : JR JR be 



!1 if t < 0, 

1 - t if 0 < t < 1, 
0 if * > 1, 

and define g e (x) = f(d(x,B)/e). Then g e is bounded and continuous. So, 



lim / g e (x)dP n (x) = f g € (x)dP(x). 

n-^ooj J 



It is easy to see that 0 < g e (x) < 1, g € (x) = 1 for all x € B, and g e {x) = 0 for all 
x £ C € . Hence, for every 6 > 0, 

Pn(B) = J I B (x)dP n (x) < J g e (x)dP n (x) - J g*(x)dP(x) 
< J Ic € (x)dP(x) = P(C € ) < P(B) + 6. 

It follows that limsup,^ P n (B) < P(B), which is (2). 

That (2) and (3) are equivalent follows easily from the facts that if A is open, 
then B = A c is closed and P n (A) = 1 - P„(B). It is also easy to see that (2) and 
(3) together imply (4). Next assume (4), let B be a closed set, and define C e as 
above. The boundary of C £ is a subset of {x : d(x, B) = e}. There can be at most 
countably many e such that these sets have positive probability. Hence, there 



36 This theorem is used in the proofs of Theorem B.88 and Lemma 7.19. 

37 We use the symbol d in front of the name of a subset of a topological space 
to refer to the boundary of the set. The boundary of a set C in a topological space 
is the intersection of the closure of the set with the closure of the complement. 



B.4. Limit Theorems 637 



exists a sequence {tk}kLi converging to 0 such that P(d(X, B) = e*) = 0 for all 
k. It follows that lim n -*oo Pn(C e J = P(C Cfc ) for all k. Since P n {B) < P n (C €k ) 
for every n and fc, we have, for every fc, 

Urn sup P n (B) < lim P n (C efc ) = P(C efc ). 

n— ►oo n— *oo 

Since P(B) = limfc—oo P(Ce fc ), we have (2). So, (2), (3), and (4) are equivalent 
and (1) implies (2). 

All that remains is to prove that (2) implies (1). Assume (2), and let / be 
a bounded continuous function. Let m < f(x) < M for all x. For each fc, let 
Pi,fc = {x : f(x) < m + (M - m)i/k} for i = 1, . . . , k. Let F 0 ,fc = 0- Each is 
closed, since / is continuous. Let Gi t k = Pi.fc \ Pt-i,fc for i = 1, . . . , k. It is easy 
to see that for every probability Q, 



— 1 f 

m + (M - m) ^ l —Q(Gi,k) < / f(x)dQ(x) <m + (M-m)^ ^Q(<?i, fc ). 

Since Q(Gi,k) = Q(Pi,fc) — Q(Pi-i.fc) for every i and fc, we get 

M — 2^0(F ilfc ) < / f(x)dQ(x) <M + — — 2^Q(Fi, fc ). 

«=i ^ i=i 

(B.84) 

For each i, 

lim sup P n {F iik ) < P(F ijfc ). (B.85) 



It follows that, for every fc, 

'/(x)dP(x) < M+^-^^P^) 



k 

M — m M — m 
T 



< M 4- r r — > hmsupP n (Ft,fe) 



t=l 



< M t m + liminf / /(x)dP n (x), 

where the first inequality follows from the second inequality in (B.84) with Q = P, 
the second inequality follows from (B.85), and the third inequality follows from 
the first inequality in (B.84) with Q = P n . Letting k be arbitrarily large, we get 

J f{x)dP{x) < liminf J f(x)dP n (x). (B.86) 

Now, apply the same reasoning to — / to get 

- / f{x)dP(x) < liminf / -f{x)dP n {x) = - limsup / f(x)dP n (x), 

J n^oo J n-oo J 

J f(x)dP{x) > limsup / f(x)dP n {x). (B.87) 

J n—KX> J 

Together, (B.86) and (B.87) imply (1). □ 



638 Appendix B. Probability Theory 

Theorem B.88 (Continuous mapping theorem). 38 Let {X n }^Li be a se- 
quence of random quantities, and let X be another random quantity all taking 
values in the same metric space X. Suppose that X n — ► X. Let y be a metric 
space and let g : X — ► y. Define 

C g = {x : g is continuous at x}. 
Suppose that Pr(X € C g ) = 1. Then g(X n ) % g(X). 

Proof. Let P n be the distribution of g{X n ) and le t P be the distribution of 
g(X). Let B be a closed subset of y. If x € but a; £ then g is 

not continuous at x. It follows that g~ 1 (B) C g~ l (B) U C£\ Now write 



limsupP n (B) = limsup Pr(X„ € g (B)) < limsup Pr(X n € 0- x (fl)) 



< Pr(X € g-^B)) < Pr(X € g'^B)) + Pr(X € C 9 °) 
= Pr(X eg' 1 (B)) = P(B), 

and the result now follows from the portmanteau theorem B.83. □ 
Another type of convergence is convergence in probability. 

Definition B.89. If {X n }^ = i and X are random quantities in a metric space 

with metric d, and if, for every e > 0, lim n -*oo Pr(d(X n , X) > e) = 0, then we 

p 

say that X n converges in probability to X, which is written X n — ► X. 

The following theorem is useful in that it relates convergence in distribution, 
convergence in probability, and the simpler concept of convergence almost surely. 

Theorem B.90. 39 Let {X n }%Li be a sequence of random vectors and let X be a 
random vector. 

p 

1. If\\m n -+ooXn = X a.s., then X n — ► X. 

2. IfXn^X, thenX n ^X. 

3. If X is degenerate and X n — ► X, then X n -* X. 

4. IfX n X, then there is a subsequence {n k }kLi such that lim fc -oo X nk = 
X, a.s. 

PROOF. First, assume that X n converges a.s. to X. For each n and e, let A n ,e = 
{s : d(X n (s) i X(s)) < e}. Then X n (s) converges to X(s) if and only if 



s e 

All c 



n u 



Since this set must have probability 1, then so too must U??=i (n~=AfA> £ )Jor 
all e. By Theorem A.19, it follows that for every e, lim^oo Pr (n^ N A n , e ) - 1. 



38 This theorem is used to provide a short proof of DeFinetti's representation 
theorem for Bernoulli random variables in Example 1.82 on page 46. 
39 This theorem is used in the proofs of Theorems B.95, 1.49, 7.26, and 7.78. 



B.4. Limit Theorems 639 



Hence, for each e > 0, lim n — oo Pr(j4£ c ) = 0, which is precisely what it means to 



p 

say that X n — ► X. 

p 

Next, assume that X n — > X. Let # : X — > IR be bounded and continuous with 
\g(x)\ < K for all x. Let e > 0, and let A be a compact set with Pr(X e A)> l- 
e/[6K\. A continuous function (like g) on a compact set is uniformly continuous. 
So let 6 > 0 be such that x £ A and d(x, ?/) < 6 implies |#(x) - g(y)\ < e/3. Since 
X n X, there exists AT such that n> N implies Pr(d(X n , X) < (5) > 1 -e/[6tf]. 
Let B = {Xe i4,rf(X n ,X) < 6}. It follows that \g{X)I B - g(X n )I B \ < e/3 and, 
for all n > AT, Pr(B) > 1 - e/[3K]. Also, note that n > N implies 

|Ep(X) - E[g(X)I B )\ < |, |E<?(X n ) - Efo(X„)Ji,]| < |. 
So, n > JV implies 

|E0(X)-EcKX n )| < |E0(X)-Eb(X)/ B ]| + |E[^ 
+ |Efo(X n )J B ]-E0(X n )| 

< £+£+l =e 
~ 3 3 3 

Thus, limn-^oo Eg(X n ) = Ep(X), and we have proven X n % X. 

Next, suppose that X is degenerate at x 0 and X n £ X. Let e > 0, and define 
' 1 ifd(x,x 0 )<§, 
0 ifd(x,x 0 )>e, 
2 - ' ^ otherwise. 

Since g is bounded and continuous, Eg(X n ) converges to E#(X). But Eg(X) = 1 
since Pr(p(X) = 1) = 1, and E</(X n ) < Pr(<f(X n , X) < e), since 0 < p(x) < 1 for 
all x. So lining Pr(d(X n , x 0 ) < e) = 1, and X n £ X. 

Finally, assume that X n 4 X. Let n* be such that n>n k implies 

Pr(d(X n ,X)>I) <2" fc . 
Define A* == {d(X nfc ,X) > 1/fc}. By the first Borel-Cantelli lemma A.20, we 

♦ !??Lv ' ^ * = 0Sl Ur=i Afc " ft is ^ to check that B * the 
event that d(X nfc ,X) is at least 1/fc for infinitely many different *. Hence £ c C 
{lim*-*, X nk = X}, and lim*-^ X n , = X, a.s. B 

B.4.2 Characteristic Functions 

There is a very important method for proving convergence in distribution which 
involves the use of characteristic functions. 

Deanition B.91. Let X be a random vector. The complex-valued function 

<t>x{t) = E (exp[2t T X]) 

ta called the characteristic function of X. If F is a fc-dimensional distribution 
function the function 0 F (t) = / exp[zt T x]dF(x) is called the characteristic func- 
tion of F. 



640 Appendix B. Probability Theory 



Example B.92. Let X have standard normal distribution. Then 

**(t) = |exp( it x)^exp(-^dx=^Jexp(-ii^4^l)dx 

= ex p("l)- 

Similarly, for other normal distributions, AT(/x, <r 2 ), the characteristic functions 
are (j)x(t) = exp(-a 2 t 2 /2 -f 

By Theorem B. 12, if X has CDF F, then <f>x = </>f. It is easy to see that the 
characteristic function exists for every random vector and it has complex absolute 
value at most 1 for all t. Other facts that follow directly from the definition are 
the following. If Y = aX + 6, then <j> Y (t) = <t>x(at)exp(itb). If X and Y are 
independent, <j>x+Y = <l>x<t>Y- 

The reason that characteristic functions are so useful for proving convergence 
in distribution is two-fold. First, for each characteristic function 0, there is 
only one CDF F such that <\>f = (See the uniqueness theorem B.106.) Sec- 
ond, characteristic functions are "continuous" as a function of the distribution 
in the sense of convergence in distribution. That is, X n — ► X if and only if 
lim n ^oo </>x n (t) = (f>x(t) for all t. 40 (See the continuity theorem B.93.) 

Theorem B.93 (Continuity theorem). 41 For finite-dimensional random vec- 
tors, convergence in distribution is equivalent to convergence of characteristic 
functions. That is, X n X if and only i/lim n ^oo 0x n (*) = </>x(t) for all t. 

Proof. The "only if" part follows from Definition B.80 and the fact that one 
can write exp(it T x) as two bounded, continuous, real- valued functions of x for 
every t. 

For the "if" part, suppose that X is fc-dimensional and that lim n — «> </>x n W = 
(j>x(t) for all t. To prove that for each bounded continuous g, lim„^oo Eg(X n ) = 
Eg(X), we will truncate g to a bounded rectangle and then approximate the 
truncated function by a function g' whose mean is a linear combination of values 
of the characteristic function. The mean of g'(X n ) will then converge to the mean 
of g\X). We then need to show that the means of g\X) and g\X n ) approximate 
the means of g(X) and g(X n ), respectively. 

First, we need to find a bounded rectangle on which to do the truncation. For 
each coordinate X e of X, we will show that if a and b are continuity points of 
the CDF F x e of X e , and F X i(b) - F x t(a) > q, then there is b' > b and a <a 
such that limn^oo F x e (b') - F x < {a') > q. For each a, 6, 6, define 



}a,bA x ) = < 



1 if a < x < 6, 

1-2=2 iia-6 < x < a, 

1 - ^ if6<x<6 + 6, 

0 otherwise. 



(B.94) 



40 This presentation is a hybrid of the presentations given by Breiman (1968, 
Chapter 8) and Hoel, Port, and Stone (1971, Chapter 8). 

41 This theorem is used in the proofs of Theorems B.95, B.97, and 7.20. 



B.4. Limit Theorems 641 



Note that this function has equal values at a — 6 and b + 6. Consider the inter- 
val [a — 6, b + 6] as a circle identifying the two endpoints. Now, use the Stone- 
Weierstrass theorem C.3 to approximate uniformly f a ,b,6 to within e on the circle 
by fa,b,6,e( x ) — Y^j=-e exp(27rijx/c), where c = 6 - a + 26. If Y is a random 
variable, then Ef aib j € (Y) is a linear combination of values of the characteristic 
function of Y. So, we have lim n — oo Ef' ayh ^ i€ {X n ) = Ef' abi6e (X). Let q > 0, and 
let a and b be continuity points of F x e such that F x t (b) — F x i (a) = v > q. Let 
w = v — q. Let 6 > 0 be arbitrary, and define a' = a — 6 and b' = b + 6. Let N be 
large enough so that n > N implies \Ef' aMtW/3 {XZ) - Ef' a ^ w/3 (X e )\ < w/3. If 
n> N, then 

^xi(O-^fi(«0 > E/ a , M (X<) > E/^ M , f (X£)-f 

> F x *(&)-F x ,(a)-w = q. 

Now, let ^ be a bounded continuous function, and suppose that \g(x)\ < K for 
all x. Let € > 0. For each coordinate X e of X, let and be continuity points 
of F x t such that F x e(b e ) -F x e(a e ) > l-e/(7[K + e/1\k). Let 6 > 0 be arbitrary, 
and define aj = at — 6, 6j = be — 6, and <?*(ff) = g(x) J~J^ =1 /oj.bj^^)- Use the 
Stone-Weierstrass theorem C.3 to uniformly approximate p* to within e/7 on the 
rectangle {rr : a' e — 6 < xe < b' e + 6} by 

mi m fc 

0 0*0 = "" H a ii.-.ifc ex P( 27r y Tx )» 

Jl=- m l Jfc=-™fc 

where j is the vector with ith coordinate je/[b' e — a' e + 26]. Then, 

lim Ep'(Xn) = Ep'(X). 

n— vex) 

Let iV*i be large enough so that n > Ni implies F x / (b' e ) — F X £ (aj) > 1 — e/(7[K + 
e/7]fc) for all j. Let iV*2 be large enough so that n > N2 implies \Eg'(X n ) — 
Eg'(X)\ < e/7. Let R be the rectangle R = {x : a' e < xt < b' e }. Since is periodic 
in every coordinate, it is bounded by K + e/7 on all of IR fc . If n > max{iVi, AT 2 }, 
then |Ep(X n ) — Eg{X)\ is no greater than 

E\g(X n )I R c(X n )\ + E\g(X)I R c(X)\ + E\g'(X n )I R c(X n )\ 

+ \Eg(X n )I R (X n ) - Eg'(X n )I R (X n )\ + \Eg'(X n ) - Eg'(X)\ 

+ E\g'(X)I RC (X)\ + \Eg'(X)I R (X) - Eg(X)I R (X)\ <e. □ 

We will prove two more limit theorems that make use of the continuity theo- 
rem B.93. Suppose that X has finite mean. Since |exp(ito) — 1| < min{|tx|,2} 
for alH, x, 42 and 

exp(itx) — 1 
hm — — — = ix 

t-fO t 



See Problem 26 on page 664. 



642 Appendix B. Probability Theory 



for all it follows from the dominated convergence theorem that 

±6 x (t)\ = iE(X). 
at lt=o 

Similarly, if X has finite variance, it can be shown that 

d 2 



= -E(X 2 ). 



Using these two facts, we can prove the weak law of large numbers and the central 
limit theorem. 

Theorem B.95 (Weak law of large numbers). Suppose that {X n }%Li are 
HD random variables with finite mean \i. Then, X n = Y^=i ^*/ n converges in 
probability to \i. 

PROOF. First, we will prove that the characteristic function of X n —fi converges to 
1 for all t. Let YJ = Xi — fi. Since 0^(0) = 1, log^y^t) exists and is difFerentiable 
near t = 0, and we know that 

^ log0n( O) = O = li m l2i^l. (B.96) 
at t—*o t 

The characteristic function of ~X n - a is <t>*(t) = <t>Yi(t/n) n . For fixed t, let n be 
large enough so that t/n is close enough to 0 for log^y^i/n) to be well defined. 
We know that 

logMt) = nlog.M(^)=t l06 , } - 

The limit of this quantity, as n oo, is 0 by (B.96). It follows that for all <, 
lim n _oo <t>*{t) = !• B y the continuity theorem B.93, X n - a % 0. By Theo- 
rem B.90, X n - » $ 0. D 

In Chapter 1, we prove a strong law of large numbers 1.62, which has a stronger 
conclusion and a weaker hypothesis. There is also a weak law of large numbers 
for the case of infinite means. (See Problem 27 on page 664.) 

The following theorem is very useful for approximating distributions. 

Theorem B.97 (Central limit theorem). Suppose that {Xi} t ~i is a sequence 
that is IID with finite mean p and finite variance a 2 . Let X n be the average of 
the first n X iS . Then y/h~(X n -n) ° N(0, a 2 ), the normal distribution with mean 
0 and variance a 2 . 

Proof. Set Y n = y/ri(X n -ii). We might as well assume that /i = 0, since we have 
just subtracted it from each X<. Since the second derivative of the characteristic 
function at t = 0 of each X { is -<r 2 , we can apply l'Hopital's rule twice to conclude 

lim iog^iW = (B.98) 



t— o 



t 2 2 



The characteristic function of Y n is <j>y n {t) = *x 4 (*A/S) n - We will prove that 
this converges to exp(-*V/2) for each t Since log0y n (*) = n log 



B.4. Limit Theorems 



643 



we use (B.98) to note that 



lim n 




n—*oo 



It follows that lim n — oo <l>Y n {t) = exp(— £ 2 cr 2 /2), and the continuity theorem 3.93 



There is also a multivariate version of the central limit theorem. 

Theorem B.99 (Multivariate central limit theorem). 43 Let {X n }J° =1 be a 
sequence of IID random vectors in IR P with mean \x and covariance matrix E. 
Then \fn{X n — /x) ^ i\T p (0, £), a multivariate normal distribution. 

Proof. Let Y n = y/n(X n - /x) and let Y ~ JV p (0,E). Then Y n % Y if and 
only if the characteristic function of Y n converges to that of Y. That is, if and 
only if, for each A 6 IR P , Eexp{iA T y n } — > Eexp{iA T y}. This occurs if and 

only if, for each A, X T Y n % X T Y. The distribution of X T Y is JV~(0, A T EA), and 
X T Y n is \fn times the average of the A T (X n — /i). By the univariate central limit 

theorem B.97, X T Y n Z X T Y. □ 
There are inversion formulas for characteristic functions which allow us to 
obtain or approximate the original distributions from the characteristic functions. 

Example B.100 (Continuation of Example B.92; see page 640). Let X have 
distribution AT(0,<r 2 ). Then / \</> x (t)\dt < oo. In fact, 



Example B.100 says that the following inversion formula applies to normal 
distributions with 0 mean. It is equally easy to see that it applies to Nk(0,Ik) 
distributions. 44 

Lemma B.101 (Continuous inversion formula). 45 Let X € IR h have inte- 
grable characteristic function. Then the distribution of X has a bounded density 
fx with respect to Lebesgue measure given by 



Proof. Clearly, the function in (B.102) is bounded since <j>x is integrable. Let Y a 
have iVfc(0,<7 2 /fc) distribution. The characteristic function of X + Y a is <t>x<t>Y a - 



'This theorem is used in the proofs of Theorems 7.35 and 7.57. 

We use the symbol Ik to stand for the k x k identity matrix. 

This lemma is used in the proofs of Lemma B.105 and Corollary B.106. 



finishes the proof. 



□ 





(B.102) 




644 Appendix B. Probability Theory 

= J J exp(-it T x)exp(it T z)<t> Ya (t)dF x (z)dt (B.103) 

= J fv a (x - z)dFx(z) = fx+Y a (x), 

where the second equality follows from the fact that (B.102) applies to normal 
distributions. Now suppose that we let a go to zero. Since <j>x is integrable and 
<t>Yo (*) 6 oes to 1 for all t, it follows that the left-hand side of (B.103) converges to 
the right-hand side of (B.102). It also follows that fx+Y*, is bounded uniformly 
in a and x. Let B be a hypercube such that the probability is 0 that X is in the 
boundary of B. Then 

/ lim fx+y a (x)dx = lim / /*+ Ya (x)dx = / fx(x)dx 1 (B.104) 
Jb*-* 0 a ~+ 0 JB Jb 

where the first equality follows from the boundedness of fx+Y a , and the second 
is proven as follows. The difference between f B fx+Y<x(x)dx and f B fx(x)dx is 
the sum over the 2 k corners of the hypercube B of terms like 

k 

Pr (k - Y* ti <Xi<b u Y* ti > 0) + Pr(6i < Xi < 6, - Y a ,i, Y^i < 0), 

i=i 

where bi is the ith coordinate of the corner. We can write 

Pr(6i - Y^i <Xi<bi, Y^i > 0) = / Pr(6i - y < Xi < 6 i} y > 0)dF Yt7ti (y). 

Jo 

This last expression goes to 0 as a — ► 0 since bi is a continuity point for Fxi- 
A similar argument applies to the other probability. The equality of the first 
and last expressions in (B.104) is what it means to say that lim<7— o fx+Y a (x) 
is the density of X with respect to Lebesgue measure. This, in turn, equals the 
right-hand side of (B.102). □ 

Lemma B.105. 46 Let Y be a random variable such that <\>y is integrable. Let X 
be an arbitrary random variable independent ofY. For all finite a <b and c, 

Pr(a <X + cY<b) = lf ^x P (-^) - exp(-^) ^ 0x(t)0y (ct)dt . 

Proof. Since (j> Y is integrable and </>x+ c y(*) = (j>x(t)^> Y {ct), it follows that 
X + cY has integrable characteristic function. Lemma B.101 says that (B.102) 
applies to X + cY, hence 

fx+c Y (x) = j 0x(*)0y(rf)exp(-<tx)(ft 
Pr(a<X + cY < 6) = / f x +cY(x)dx 

J a 



46 This lemma is used in the proof of Corollary B.106. 



B.5. Stochastic Processes 645 




J <t>x{t)<t>Y(ct)exp(—itx)dtdx 
<t>y(ct)<l>x(t) I exp(—itx)dxdt 




J J a 

± J <t>y{ct)<t>x{t) 




dt. □ 



Corollary B.106 (Uniqueness theorem). 47 Let F and G be two univariate 
CDFs such that <l>F = <t>G- Then F = G. 

PROOF. In the proof of Lemma B.101, we proved that if Y ~ N(0, 1), and if a and 
b are continuity points of F, and X has CDF F, then lim c — o Pr(a < X + cY < 
b) = Pr(a < X < b). The same is true of G. Hence, F = G by Lemma B.105. □ 
An obvious consequence of the uniqueness theorem is the following. 

Corollary B.107. 48 Suppose that F and G are k-dimensional CDFs such that 
for every bounded continuous f, j f(x)dF(x) = J f(x)dG(x). Then F = G. 



B.5 Stochastic Processes 
B.5.1 Introduction 

Sometimes we wish to specify a joint distribution for an infinite sequence of 
random variables. Let (S, A, /x) be a probability space. If X n ' S — ► IR for every 
n and each X n is measurable with respect to the Borel a-field B, we can define 
a cr-field of subsets of 1R°° such that the infinite sequence X = (Xi, X2, . . .) is 
measurable. Let B°° be the smallest cr-field that contains all finite-dimensional 
orthants, that is, every set B of the form 

{x : Xi x < ci, . . . , Xi n < c n , for some n and some integers ii, . . . , i n 



It is clear that X~ X (B) 6 A since it is the intersection of finitely many sets in 

A. By Theorem A. 34, it follows that X~ 1 (B°°) C A, so X is measurable with 
respect to this a-field. 

B. 5.2 Martingales 4 " 

A particular type of stochastic process that is sometimes of interest is a martin- 
gale. [For more discussion of martingales, see Doob (1953), Chapter VII.] 



This corollary is used in the proof of Theorem 2.74. 
48 This corollary is used in the proof of DeFinetti's representation theorem 1.49. 
+ This section contains results that rely on the theory of martingales. It may 
be skipped without interrupting the flow of ideas. 



and some numbers c\ 



...,c„}. 



646 Appendix B. Probability Theory 



Definition B.108. Let (S, A, /x) be a probability space. Let M be a set of consec- 
utive integers. For each n € A/\ let Tn be a sub-a-field of A such that 7" n C T n +i 
for all n such that n and n + 1 are in Af. Let {X n } n ejsf be a sequence of ran- 
dom variables such that X n is measurable with respect to T n for all n. The 
sequence of pairs {(X n) ^ r n )}neA/' is called a martingale if, for all n such that n 
and n + 1 are in jV, E(X n+ i|^ r n ) = X n . It is called a submartingale if, for every 
n, E(X n+ i|^n)>X n . 

Note that a martingale is also a submartingale. 

Example B.109. A simple example of a martingale is the following. Let N = 
{1,2, . J^and let {Yn}™^ be independent random variables with mean 0. Let 
X n = Let ^ n be tne afield generated by Yi, . . . , F n . Then, 

E(Ki + • • • + Y n +1 | ^n) = Vl + • • • + Y n = X n , 

since E(y n+ i|.F n ) = 0 by independence. If each K» has nonnegative finite mean, 
then E(X n+ i|^ r n ) > X n , and we have a submartingale. 

Example B.110. Another example of a martingale is the following. Let M be a 
collection of consecutive integers, and let {T n }ne/sr be an increasing sequence of 
a-fields. Let X be a random variable with E(|X|) < oo. Set X n = E(X|J" n ). By 
the law of total probability B.70, 

E(X n+1 |^ n ) = E\E(X\r n +i)\F n ] = E(X\?n) = X n , 

so {(Xn^Tn^nejsT is a martingale. 

Example B.lll. If {(X n ,Fn)}n€N is a martingale, then 

\X n \ = |E(X n+1 |J* n )| < E(\X n+1 \\T n ) , (B.112) 

hence {(\X n \^n)}ne^ is a submartingale. 

The following result is proven using the same argument as in Example B.lll. 

Proposition B.113. 49 If {(X ni ^n)}neM ^ a martingale, then E\X n \ is nonde- 
creasing in n. 

The reader should note that if {(X n ,T n )}neM is a submartingale and if M C 
M is a string of consecutive integers, then {(X n , Fn^neM is also a submartingale. 
Similarly, if k is an integer (positive or negative) and M = {n : n + k e AT}, then 
{{X n ,F n )} n £M is a submartingale, where X' n = X n +k and T' n = T n +k> This 
latter is just a shifting of the index set. 

There are important convergence theorems that apply to many martingales 
and submartingales. They say that if the set M is infinite, then limit random 
variables exist. A lemma is needed to prove these theorems. 50 It puts a bound on 
how often a submartingale can cross an interval between two numbers. It is used 
to show that such crossings cannot occur infinitely often with high probability. 
(Infinitely many crossings of a nondegenerate interval would imply divergence of 
the submartingale.) 



49 This proposition is used in the proof of Theorem B.122. 
50 This lemma is proven by Doob (1953, Theorem VII, 3.3). 



B.5. Stochastic Processes 



647 



Lemma B. 114 (Upcrossing lemma). 51 Let M = {l,...,iV}, and suppose 
that {(X n ,^n)}n=i m a submartingale. Let r < q, and define V to be the number 
of times that the sequence X\, . . . , Xn crosses from below r to above q. Then 

E(V) < — (E\X N \ + \r\) . (B.115) 
q — t 

Proof. Let Y n = max{0, X n — r} for every n. Since g(x) = max{0, x} is a non- 
decreasing convex function of x, it is easy to see (using Jensen's inequality B.17) 
that {Y n ,Fn}n=i is a submartingale. Note that a consecutive set of Xi(s) cross 
from below r to above q if and only if the corresponding consecutive set of Yi(s) 
cross from 0 to above q — r. Let To(s) = 0 and define T m for m = 1, 2, . . . as 

T m {s) = inf{fc <N :k> T m _i(s), Y k (s) = 0}, if m is odd, 
Tm(s) = inf{fc <N :k> T m _i(s), Y k (s) >q-r], if m is even, 
Tm(s) = N + 1, if the corresponding set above is empty. 

Now V(s) is one-half of the largest even m such that T m (s) < N. Define, for 



*<•>-{ i 5 



T m (s) < i < T m+ i(s) for m odd, 
otherwise. 



Then fa - r)V(s) < x Ri(s)(Yi(s) - Yi-i(s)) = X, where Y 0 = 0 for conve- 
nience. First, note that for all m and i, {T m (s) < i} e Ti. Next, note that for 
every t, 

{s : = 1} = (J ({r m <i-i}n{r m+1 <i-i} c )e;Fi-i. (B.ne) 

m odd 



^j{s:Ri{s) = l} 

N r 



E(X) = >J (y 4 (»)-r,_i(»Mi(«) 

(EWIJi-OW-tf-ifaMtW 



} 



= ^(E(r i )-E(y i -i)) = E(y iV ), 

where the second equality follows from (B.116) and the inequality follows from 
the fact that {Y ni Tn}n=i is a submartingale. It follows that (q-r)E(V) < E(Yn). 
Since E(Y N ) < \r\ + E(|Xjv|), it follows that (B.115) holds. □ 
The proof of the following convergence theorem is adapted from Chow, Rob- 
bins, and Siegmund (1971). 



51 



This lemma is used in the proofs of Theorems B.117 and B.122. 



648 Appendix B. Probability Theory 



Theorem B.117 (Martingale convergence theorem: part I). 52 Suppose 
that {(X n , J n )}n=i is a submartingale such that sup n E|X n | < oo. Then X = 
lim n -,oo X n exists a.s. and E\X\ < oo. 

Proof. Let X* = limsup n _ 00 X n and X* = liminf n ->oo X n . Let B = {s : 
X*(s) < X*(s)}. We will prove that fi{B) = 0. We can write 

B= |J {s:X*(s)>q>r>X+(s)}. 

r < q, r, q rational 

Now, X*(s) > q > r > X*(s) if and only if the values of X n (s) cross from being 
below r to being above q infinitely often. For fixed r and q, we now prove that 
this has probability 0; hence = 0. Let V n equal the number of times that 
Xi, . . . , X n cross from below r to above q. According to Lemma B.114, 

supE(Vn) < — ( supE(|X n |) + \r\ ) < oo. 

n Q-r \ n J 

The number of times the values of {X n (s)}5JLi cross from below r to above q 
equals lim n — oo Vn{s). By the monotone convergence theorem A. 52, 

oo > supE(V n ) = E( lim V n ). 

n n-^oo 

It follows that n({s : lim n _,oo V n (s) = oo}) = 0. 

Since n(B) = 0, we have that X — lim n ->oo X n exists a.s. Fatou's lemma A. 50 
says E(|X|) < liminf n -ooE(|X n |) < sup n E(|X n |) < oo. □ 

For the particular martingale in which X n = E(X\P n ) for a single X, we have 
an expression for the limit. 

Theorem B.118 (Levy's theorem: part I). 53 Let {J"n}^=i be an increasing 
sequence of a -fields. Let Too be the smallest a -field containing all of the T n - Let 
E(|X|) < oo. Define X n = E{X\F n ) and Xoo = E(X|^oo). Then lim n -*oo X n = 
Xqo, a.s. 

The proof of this theorem requires a lemma that will also be needed later. 

Lemma B.119. 54 Let {^n}^°=i be a sequence ofo-fields. Le*E(|X|) < oo. Define 
X n = E(X|JF n ). Then {X n }£Li is a uniformly integrable sequence. 

PROOF. Since E(X|^ n ) = E(X+\T n ) - E(X~|,Fn), and the sum of uniformly 
integrable sequences is uniformly integrable, we will prove the result for nonneg- 
ative X. Let A c , n = {X n > c} e T n . So X n (s)d^(s) = X(s)d^i(s). If 
we can find, for every e>0,aC such that J Ac ^ X (s)dfj,{s) < e for all n and all 
c > C, we are done. Define n(A) = f A X(s)dfi(s). We have n < /x and n is finite. 

52 This theorem is used in the proof of Theorems B.118 and 1.121. 
53 This theorem is used in the proofs of Theorem 7.78 and Lemma 7.124. 
54 This lemma is used in the proofs of Theorems B.118, B.122, and B.124. It is 
borrowed from Billingsley (1986, Lemma 35.2). 



B.5. Stochastic Processes 



649 



By Lemma A. 72, we have that for every c > 0 there exists 6 such that n{A) < 8 
implies rj(A) < c. By the Markov inequality B.15, 



for all n. Let C = 2E{X)/6. Then c > C implies n{A c ,n) < 6 for all n, so 



Proof of Theorem B.118. By Lemma B.119, {X n }n=i is a uniformly integrable 
sequence. Let Y be the limit of the martingale guaranteed by Theorem B.117. 
Since Y is a limit of functions of the X n , it is measurable with respect to J 7 ^. It 
follows from Theorem A.60 that for every event A, lim n — oo E(X n i>0 = E(YJa). 
Next, note that, for every A £ 



where the last equality follows from the definition of conditional expectation. 
Since this is true for every n and every A £ T n , it is true for all A in the field 
F = U^Li^n. Since |X| is integrable, we can apply Theorem A.26 to conclude 
that the equality holds for all A € Too, the smallest <r-field containing T. The 
equality E(XI A ) = E{YI A ) for all A e Too together with the fact that Y is J*oo 
measurable is precisely what it means to say that Y = E(X\J r 0 o) = X<x>> □ 
For negatively indexed martingales, there is also a convergence theorem. Some 
authors refer to negatively indexed martingales in a different fashion, which is 
often more convenient. 

Definition B.120. Let (S,A,n) be a probability space. For each n = 1,2, . . ., 
let T n be a sub-a-field of A such that T n +\ Q T n for all n. Let {X n }^ =1 be a 
sequence of random variables such that X n is measurable with respect to T-n. for 
all n. The sequence of pairs {(X n , ^*n)}^=i is called a reversed martingale if for 
aUnE(Xn|Jn+i) = X n+ i. 

Example B.121. As in Example B.110, we can let {JiJSJLi be a decreasing 
sequence of cr-fields, and let E(|X|) < oo. Define X n = E(X|J" n ). It follows from 
the law of total probability B.70 that {(X n ,f n )}~ 2 is a reversed martingale. 

The following theorem is proven by Doob (1953, Theorem VII 4.2). 

Theorem B.122 (Martingale convergence theorem: part II). 55 Suppose 
that {(X n ,f n )}n<o *5 a martingale. Then X = lim n —-oo X n exists a.s. and has 
finite mean. 

PROOF. Just as in the proof of Theorem B.117, we let V n be the number of times 
that the finite sequence X n , X n +i, . . . , X-\ crosses from below a rational r to 
above another rational q (for n < 0). The upcrossing lemma B.114 says that 




n{A c ,n) < c for all n. 



□ 



/ Y(s)dii(s) = lim / E(X\r n )(8)dn{8) = [ X(s)d/i(s), 




E(V n ) < 



1 



(E(|X_i|) + |r|)<oo. 



q — r 



This theorem is used in the proof of Theorem B.124. 



650 Appendix B. Probability Theory 



As in the proof of Theorem B.117, it follows that X = lim n — oo X n exists with 
probability 1. From (B.112) and Lemma B.119, it follows that 

lim E(1X„|) = E(|X|). 

n— ♦ — oo 

By Proposition B.113, it follows that E(|X|) < oo, and so X has finite mean. □ 
It is usually more convenient to express Theorem B.122 in terms of reversed 
martingales. 

Corollary B.123. 56 If {(X n , Fn)}™^ is a reversed martingale, then lim n — oo X n 
exists a.s. and has finite mean. 

There is also a version of Levy's theorem B. 118 for reversed martingales. 

Theorem B.124 (Levy's theorem: part II). 57 Let {fn}%Li be a decreasing 
sequence of cr -fields. Let Too be the intersection D^-iTn- Let E(|X|) < oo. Define 
X n = E(X\T n ) and Xoo = E(X\Too). Then lim n -oo X n = Xoo a.s. 

Proof. It is easy to see that {(X n , .Fn)}£Li is a reversed martingale and that 
E(|Xi|) < oo. By Theorem B.122, it follows that lim n -— oo X n = K exists and is 
finite a.s. To prove that Y = Xoo a.s., note that Xoo = E(Xi \Too) since Too Q Fx* 
So, we must show that Y = E(Xi|J"oo). Let A € Too- Then 

/ X n (s)dfi(s) = / XxWdvis), 

J A J A 

since A 6 T n and X n = E(Xi|^*n). Once again, using (B.112) and Lemma B.119, 
it follows that J A Y(s)dii(s) = f A X 1 (s)^(s); hence Y = E(Xi|^oo). □ 

B.5.3 Markov Chains* 

Another type of stochastic process we will occasionally meet is a Markov chain. 58 

Definition B.125. Let {X n }n=i be a sequence of random variables taking val- 
ues in a space X with <r-field B. The sequence is called a Markov chain (with 
stationary transition distributions) 59 if there exists a function p : B x X [0, 1] 
such that 

• for all x e X, p( ,x) is a probability measure on 23; 

• for all B GB, p{B, •) is B measurable; 



56 This corollary is used in the proof of Theorem B.124. 

57 This theorem is used in the proofs of Theorem 1.62, Corollary 1.63, 
Lemma 2.121, and Lemma 7.83. 

This section may be skipped without interrupting the flow of ideas. 

58 In this text, we only use Markov chains as occasional examples of sequences 
of random variables that are not exchangeable. 

59 There are more general definitions of Markov chains and Markov processes 
in which the transition distribution from X n to X n +i is allowed to depend on n. 
We will not need these more general processes in this book. 



B.5. Stochastic Processes 651 



• for each n and each B 6 

p(B,x) — Pr(X n +i e B\X\ = xi,X2 = x 2 , . . . , X n -i = x n -i,X n = x), 

almost surely with respect to the joint distribution of (Xi, . . . ,X n ). 

The last condition in the definition of a Markov chain says that the conditional 
distribution of X n +i given the past depends only on the most recent past X n > In 
other words, X n +i is conditionally independent of X\ y . . . ,X n -\ given X n . 

Example B.126. A sequence {X n }n=\ of IID random variables is a Markov 
chain with p{B,x) = Pr(Xi e B) for all x. 

Example B.127. Let {X n }?=i be Bernoulli random variables such that 

Pr(X n +i = l\X! =aJi,...,X B = x n ) =p Xn ,i, 

for x n 6 {0, 1}. The entire joint distribution of the sequence is determined by the 
numbers p 0 ,i, and Pr(A"i = 1). 

B.5.4 General Stochastic Processes 

Occasionally, we will have to deal with more complicated stochastic processes. 
What makes them more complicated is that they consist of more than countably 
many random quantities. 

Example B.128. Let ? be a set of real-valued functions of a real vector. That 
is, there exists k such that F € T means F : lR fc -+ JR. Suppose that X : S -+ T 
is a random quantity whose values are functions themselves. We would like to 
be able to discuss the distribution of X. We will need a a-field of subsets of T 
m order to discuss measurability. A natural cr-field is the smallest cr-field that 
contains all sets of the form A t%x = {F € T : F(t) < *}, for all t e JR k and 
all x € 1R. It can be shown (see below) that X is measurable with respect to 
this a-field if, for every t e IR fc , the real-valued function G t : S -> IR is Borel 
measurable, where G t {s) = F(t) when X(s) = F. 

A general stochastic process can be defined, and it resembles the above example 
in all important aspects. 

Definition B.129. Let (5, Am) be a probability space, and let R be some set. 
For each r 6 R, let (X r , B r ) be a Borel space, and let X r : S - X r be measurable. 
The collection of random variables X = {X T : r € R} is called a stochastic 
process. 

Example B.130. If every (X r ,B r ) is the same space (X,B), then X can be 
thought of as a "random function" from R to X as follows. For ^ach s 6 5, define 

totSZZ S ~* 2 ,^ = X ' {S) - lD ° rder t0 make this a true -dom 

function, we need a a-field on the set of functions from R to X. Since this set of 

functions is the product set X , a natural a-field is the product <r-field B R . The 

product <T- field is easily seen to be the smallest <r-field containing all sets of the 

form A r B = {F : F(r) e B}, for r 6 R and B € B. Now, let F : 5 - X R be 

defined by F(s) = F a . Then F is measurable because 

F-^.b) = { S : F.(r) € B} = {s : X r (s) 6 B} € A, 
because X T is measurable. 



652 Appendix B. Probability Theory 



The important theorem about stochastic processes is that their distribution is 
determined by the joint distributions of all finite collections of the X r . 

Theorem B.131. 60 Let R be a set and, for each r <E R, let (X r ,B r ) be a Borel 
space. Let X = {X r : r e R} and X' = {X' r : r € R} be two stochastic processes. 
Suppose that for every k and every k-tuple (n, . . . , r k ) € R k , the joint distribution 
of (X ri , . . . , Xr k ) is the same as that of (X' ri ,...,.#*)• Then the distribution of 
X is the same as that of X' . 

Proof. Define X = U^r^ and let B be the product a-field. Say that a set 
C £ 23 is a finite-dimensional cylinder set if there exists k and n, . . . , r k € R and 
a measurable D C n* =1 X n such that 

C = {xeX:( Xri ,...,Xr k )eD}. 

It is easy to see that if {r u . . . , r k } C {t u . . . ,t m } for m > fc, then there exists a 
measurable subset D' of f[^ =1 X Sj such that 

C = {x€X:(x sl ,...,x am )eD'}, 

by taking the Cartesian product of D times the product of those X r for r € 
{si, . . . , s m } \ {n, . . . , r k } and then possibly rearranging the coordinates of all 
points in this set to match the order of n, . . . , r k among si, . . . , s m . So, if C and 
C? are both finite-dimensional cylinder sets with 

G = {x e X : (x hl ,...,x hi ) £ E}, 
then we can let {t u ...,t m } = {n, . . . , r k } U {hi, . . . , ^} and write 

C = {*€ AT : (x tl ,...,x tm ) €£>'}, 

It follows that 

CC\G= {xeX : (x tl ,...,ij6D'n(?'}. 

So the finite-dimensional cylinder sets form a 7r-system. By assumption, the dis- 
tributions of X and X' agree on this 7r-system. Since X = {x £ X : x r € X r } for 
arbitrary r € R and since the distributions of X and X' are finite measures, we 
can apply Theorem A. 26 to conclude that the distributions are the same. □ 
Another important fact about general stochastic processes is that it is possible 
to specify a joint distribution for the entire process by merely specifying all 
of the finite-dimensional joint distributions, so long as they obey a consistency 
condition. 

Definition B.132. Let X - Y\ reR X r with the product cr-field, where (X r , B r ) is 
a Borel space for every r. For each finite k and each fc-tuple {i\ , . . . , i k ) of distinct 
elements of R, let Pi lt ...,i k be a probability measure on f[j!=i Xi r ^ e sa y tnat 
these probabilities are consistent if the following conditions hold for each k and 
distinct ii, . . . , i k € R and each A in the product cr-field of Yij=i : 



60 This theorem is used in the proofs of Theorem B.133 and DeFinetti's repre- 
sentation theorem 1.49. 



B.5. Stochastic Processes 653 



• For each permutation n of k items, Pi x ,...,i k (A) — ft ff{1) ,..,t K(fc) (5), where 

B - {(^(i), • • • : (xi, . . . e A}. 

• For each * € /* \ {ii, . . . , i*}, P h ik (A) = P.!,.. .,.•*,*(£), where 

£= {(xi,...,a;fc,Xfc + i) : (a?i, . . . , a*) £ j4,Zfc+i € 

Since the set # may not be ordered, the first condition ensures that it does not 
matter in what order one writes a finite set of indices. The second condition is 
the substantive one, and it ensures that the marginal distributions of subsets of 
coordinates are the probability measures associated with those subsets. 

To avoid excessive notation, it will be convenient to refer to Pj as the proba- 
bility measure associated with a finite subset J C R without specifying the order 
of the elements of J. When the consistency conditions in Definition B.132 hold, 
this should not cause any confusion. 

The proof of the following theorem is adapted from Loeve (1977, pp. 94-5). 
The theorem says that consistent finite-dimensional distributions determine a 
unique joint distribution on the product space. 

Theorem B.133. 61 Let X = \\ T ^ R X T the product a-field, where X r is a 
Borel space for every r. For each finite subset J C R, let Pj be a probability 
measure on X[ r ^jX r . Suppose that the Pj are consistent as defined in Defini- 
tion B.132. Then there exists a unique distribution on X with finite- dimensional 
marginals given by the Pj. 

Proof. The uniqueness follows from Theorem B.131, if we can prove existence 
First, suppose that X r = R for all r. Let C be the class of all unions of finitely 
many finite-dimensional cylinder sets of the form C = JT C r , where all but 
finitely many of the C r equal R and the others are unions of finitely many inter- 
vals. The class C is a field. For C of the above form, define P(C) = Pj(T] C) 
The consistency assumption implies that P can be uniquely extended to alnitely 
additive . probability on C. To show that P is countably additive, we will show 
tnat it {A n J n=1 is a decreasing sequence of elements of C such that P(A n ) > c 
for all n then A = n^A n is nonempty. Suppose that P{A n ) > e for all n. Let 
j n oe tne set ot all subscripts involved in A u . . . , A n and J be the union of these 
sets Let A n = B n x ft Xr . Then P(A n ) = P Jn (B n ), and B n is the union of 
finely many products of intervals. For each product of intervals H that consti- 

t f n 'p We T dUCt ° f b ° Unded closed intervals contained in H such 

that the Pj n probability of the union of these His as close as we wish to Pj (B n ) 

such that'p ° f P Z« CtS ° f d0Sed b0Unded intervak contained"^ B n 

then ^ } < £/ " Dn iS thC CyHnder Set responding to C„, 

Pjn(A n \D n ) = P Jn (B n \C n )< 



2n+l • 

Now, let En = An^Du so that P(A n \E n ) < e/2. It follows that P(E n ) > e /2 
so each E n is nonempty. Let *» = (x?,xj,...) e E n . Since E, D E 2 D ..- it 
follows that for every k > 0, x n+k eE n C D n . Hence (x^i € J„) € C~. Since 

61 This theorem is used in the proof of Lemma 2.123. 



654 Appendix B. Probability Theory 



each C n is bounded, there is a subsequence of i 6 Ji)}%Li that converges to 

a point (xi^i G Ji) G C\. Let the subsequence be {(x" fc ;i € Ji)}™^. Then there 

is a subsequence of {(x™ k ; i E J2)}kL\ that converges to a point i 6 J2) € C2. 
Continue extracting subsequences to get a limit point xj — (x»;i € J) € D n for 
all n. Hence, every point that extends xj to an element of X is in A n for all 
n, and A is nonempty. Now apply the Caratheodory extension theorem A.22 to 
extend P to the entire product <r-field. 

For general Borel spaces, let 0 r : X r — ► F r be a bimeasurable mapping to a 
Borel subset of IR for each r. It follows easily by using Theorem A. 34 that the 
function <p : X — ► Flrefl^ is bimeasurable, where 0(x) = (<£ r (x r );r € i?). For 
each finite subset J, 0 induces a probability on fJ i€J lR from Pj, and these are 
clearly consistent. By what we have already proven there is a probability P on 
J~[ rGH IR with the desired marginals. Then induces a probability on X from 
P with the desired marginals. □ 



B.6 Subjective Probability 

It is not obvious for what purpose a mathematical probability, as described in 
this chapter and defined in Definition A. 18, would ever be useful. In this section, 
we try to show how the mathematical definition of probability is just what one 
would want to use to describe one's uncertainty about unknown quantities if one 
were forced to gamble on the outcomes of those unknown quantities. 62 

DeFinetti (1974) suggests that probability be defined in terms of those gam- 
bles an agent is willing to accept. Others, like DeGroot (1970), would only require 
that probabilities be subjective degrees of belief. Either way, we might ask, "Why 
should degrees of belief or gambling behavior satisfy the measure theoretic defini- 
tion of probability?" In this section, we will try to motivate the measure theoretic 
definition of probability by considering gambling behavior. We begin by adopting 
the viewpoint of DeFinetti (1974). 63 

For the purposes of this discussion, let a random variable be any number about 
which we are uncertain. For each bounded random variable X, assume that there 
is some fair price p such that an agent is indifferent between all gambles that pay 
c(X - p), where c is in some sufficiently small symmetric interval around 0 such 
that the maximum loss is still within the means of the agent to pay. For example, 
suppose that X = x is observed. If c(x - p) > 0, then the agent would receive 
this amount. If c(x - p) < 0, then the agent would lose -c(x - p). It must be 
that -c(x - p) is small enough for the agent to be able to pay. Surely, for x in a 



62 In Section 3.3, we give a much more elaborate motivation for the entire 
apparatus of Bayesian decision theory, which includes mathematical probability 
as one of its components. An alternative derivation of mathematical probability 
from operational considerations is given in Chapter 6 of DeGroot (1970). 

63 There are a few major differences between the approach in this section and 
DeFinetti's approach, which DeFinetti, were he alive, would be quick to point 
out. Out of respect for his memory and his followers, we will also try to point 
out these differences as we encounter them. 



B.6. Subjective Probability 655 



bounded set, c can be made small enough for this to hold, so long as the agent 
has some funds available. 

Definition B.134. The fair price p of a random quantity is called its prevision 
and is denoted P(X). It is assumed, for a bounded random quantity X, that the 
agent is indifferent between all gambles whose net gain (loss if negative) to the 
agent is c(X - P(X)) for all c in some symmetric interval around 0. 

The symmetric interval around 0 mentioned in the definition of prevision may 
be different for different random variables. For example, it might stand to reason 
that the interval corresponding to the random variable 2X would be half as wide 
as the interval corresponding to X. 

Another assumption we make is that if an agent is willing to accept each of a 
countable collection of gambles, then the agent is willing to accept all of them 
at once, so long as the maximum possible loss is small enough for the agent to 
pay. 64 An example of countably many gambles, each of which is acceptable but 
cannot be accepted together, is the famous St. Petersburg paradox. 

Example B.135. Suppose that a fair coin is tossed until the first head appears. 
Let N be the number of tosses until the first head appears. For n = 1 2 
define »>•••> 

X n = l T ifiV=:n > 
n 1 0 otherwise. 

Suppose that our agent says that P(X n ) = 1 for all n. For each n, there is c n < 0 
such that the agent is willing to accept c n (X n - 1). If - Cn 2 n is too big 
however, the agent cannot accept all of the gambles at once. Similarly, there are 
c ? > 0 such that the agent is willing to accept c n (X n - 1). If c n is too 

big, the agent cannot accept all of these gambles. The St. Petersburg paradox 
corresponds to the case in which c n = 1 for all n. In this case, the agent pays oo 
and only receives 2" in return. We have ruled out this possibility by requiring 
that the agent be able to afford the worst possible loss. 

The following example illustrates how it is possible to accept infinitely many 
gambles at once. 

Example B.136. Suppose that a random quantity X could possibly be any one 
of the positive integers. For each positive integer x, let 



-{i 



if X = x, 
if not. 



\ t °Z ! ndifferent betw <*n ^1 gambles of the form c(I x - 2 "») 

wmT «. w - a u d a 'l integers x - Then ' we t***"™ that the agent is also 
indent between all gambles of the form £~ t c x (I x - 2 -), so lolg as -1 < 
c. J 1 for all x. (Note that the largest possible loss is no more than 1.) Let 
r - 2^x=i c* 1 * w "h -1 < Cx < 1 for all x. Note that Y is a bounded random 

64 DeFinetti would not require an agent to accept countably many gambles at 
once but rather only finitely many. We introduce this stronger requirement to 
avoid mathematical problems that arise when the weaker assumption holds but 
the stronger one does not. Schervish, Seidenfeld, and Kadane (1984) describe one 
such problem in detail. 



656 Appendix B. Probability Theory 



quantity, and that the agent has implicitly agreed to accept all gambles of the 
form c(Y — jx) for -1 < c < 1, where /x = Y1T=\ c *2 _x - If the agent were 
foolish enough to be indifferent between all gambles of the form d(Y — p) for 
-a < d <a where p//i, then a clever opponent could make money with no risk. 
For example, if p > /a, let / = min{l,a}. The opponent would ask the agent to 
accept the gamble f(Y-p) as well as the gambles -fc x (I x -2~ x ) for x = 1,2, . . .. 
The net effect to the agent of these gambles is -/(p — /x) < 0, no matter what 
value X takes! A similar situation arises if p < /x. Only p = /x protects the agent 
from this sort of problem, which is known as Dutch book. 

To avoid Dutch book, we introduce the following definition. 

Definition B.137. Let {X a : q € A} be a collection of bounded random vari- 
ables. Suppose that, for each a, an agent gives a prevision P(X a ) and is indifferent 
between all gambles of the form c(X a - P(X a )) for -d a < c < d a with d a = 
min{ max x^-P(x Q ) » P(x Q )-minX Q } *° r some M > 0. These previsions are coherent 
if there exist no countable subset B C A and {q, : — db < Cb < cfo, for all b e B} 
such that -M < £ bGjB c *>(^> ~ P(-fo)) < 0 under all circumstances. 65 If a 
collection of previsions is not coherent, we say that it is incoherent 

The value M is the maximum amount the agent is willing to lose. Coherence of 
a sufficiently rich collection of previsions is equivalent to a probability assignment. 

Theorem B.138. 66 Let (S,A) be a measurable space. Suppose that t for each 
C € A f the agent assigns a prevision P(Ic), where Ic is the indicator of C. 
Define /x : A — ► JR by /x(C) = P(Ic)- Then the previsions are coherent if and 
only if fi is a probability on (S,A). 

PROOF. Without loss of generality, suppose that the agent is indifferent between 
all gambles of the form c(I c - P(/c)), for all -1 < c < 1. For the "if" part, 
assume that /x is a probability. Let {C n }%Li € A and c» G [-1, 1] be such that 
with 

oo 
n=l 

the maximum losses from X and from -X are small enough for the agent to 
afford. Since this makes X bounded, it follows from Fubini's theorem A.70 that 
E(X) = 0; hence it is impossible that X < 0 under all circumstances, and the 
previsions are coherent. 

For the "only if part, assume that the previsions are coherent. Clearly, /x(0) = 
0, since 1$ - 0 and -c/x(0) > 0 for both positive and negative c. It is also easy to 
see that n(A) > 0 for all A. If /x(A) < 0, then for all negative c, c(I A - l*(A)) < 0 
and we have incoherence. Countable additivity follows in a similar fashion. Let 
{An}ZLi be mutually disjoint, and let A = U™ =1 A n . If /jl(A) < E^i A*(^)> 



65 When only finitely many gambles are required to be combined at once, as by 
DeFinetti (1974), incoherence requires that the sum be strictly less than some 
negative number under all circumstances. That is, DeFinetti would allow a strictly 
negative gamble to be called coherent, so long as the least upper bound was 0. 

66 This theorem is used in the proof of Theorem B.139. 



B.6. Subjective Probability 657 



then the following gamble is always negative: 

oo 
n=l 

If pi(A) > J2^Li KAn), then the negative of the above gamble is always negative. 
Either way there is incoherence. □ 
Theorem B.138 says that if an agent insists on dealing with a cr-field of sub- 
sets of some set 5, then expressing coherent previsions for gambles on events is 
equivalent to choosing probabilities. 67 Similar claims can be made about bounded 
random variables. 

Theorem B.139. Let C be the collection of all bounded measurable functions 
from a measurable space (5, A) to JR. Suppose that, for each X € C, an agent 
assigns a prevision P(X). The previsions are coherent if and only if there exists 
a probability \x on (S.A) such that P(X) = E(X) for all X € C. 

Proof. Suppose that the agent is indifferent between all gambles of the form 
c(X - P(X)) for -d x < c < d x . For the "if" direction, the proof is virtually 
identical to the corresponding part of the proof of Theorem B.138. For the "only 
if" part, note that I A €C for every AeA.lt follows from Theorem B.138 that a 
probability fi exists such that p(A) = P(I A ) for all A € A. Hence P(X) = E(X) 
for all simple functions X. Let X > 0 and let X x < X 2 < • • • be simple functions 
less than or equal to X such that lim n ->oo X n = X. Then X = £~ (X n +i -X n ), 



so 



P(X) = £ P(X n+1 - X n ) = lira E(X B+ i) = E(X), 



from coherence and the monotone convergence theorem A.52. For general X, 
let X and X be, respectively, the positive and negative parts of X. Since 
P{X) = P(X ) - P(X ) follows easily from coherence, the proof is complete. D 
We conclude this "motivation" of probability theory from gambling considera- 
tions by trying to motivate conditional probability. Suppose that, in addition to 
assigning previsions to gambles involving arbitrary bounded random variables, 
the agent is also required to assign conditional previsions in the following way. 
Let C be a sub-(T-field of A, and suppose that gambles of the form d A (X-p), for 
all nonempty .4 e C, are being considered. 68 The fair price would be that value 
of p denoted P(X\A), such that the agent was indifferent between all gambles of 
the form cI A {X - P{X\A)) for all c in some symmetric interval around 0. Rather 
than choose a different P{X\A) for each A, the agent has the option of choosing 
a single function Q : S - R such that Q is measurable with respect to the afield 
C. I he conditional gambles would then be d A (X - Q). 

Example B.140. For the simple case in which C = {<t),A,A c ,S}, Q is measur- 
able if and only if it takes on only two values, one on A and the other on A c . In 

6 ^In the theory of DeFinetti (1974), one obtains finitely additive probabilities 
without assuming that probabilities have been assigned to all elements of a a- 
held. 

68 DeFinetti (1974) would only require that such conditional gambles be con- 
sidered one at a time rather than a a-field at a time. 



658 Appendix B. Probability Theory 



this case, there are only two sets of conditional gambles (other than the "uncondi- 
tional" gambles c[X-P(X)]) y namely d A (X-P(X\A)) and cI A c{X-P{X\A c )). 
Here, Q = P(X\A)I A + P(X\A C )I%. Note that the previsions P(XIa) and 
P(Ia) = fJi(A) are already expressed. It is easy to see that 

d A (X - P(X\A)) 

= c(XI A - E(XI A )) - cP(X\A)(I A - n(A)) + c[P(X\A)>i{A) - E{XI A )}. 

Clearly, the only coherent choices of P(X\A) satisfy P(X\A)n(A) = E(XI A ). If 
fx(A) > 0, then P(X\A) = E(XI A )/ fi(A) y the usual conditional mean of X given 
A. Similarly, P(X|A c )/x(A c ) = E(XI%) must hold. 

The general situation is not much different from Example B.140. 

Theorem B.141. Suppose that an agent must choose a function Q that is mea- 
surable with respect to a sub-a -field C so that for each nonempty A € C, he or 
she is indifferent between all gambles of the form cI A (X — Q). The choice of Q 
is coherent if and only if E(QI A ) = E(XI A ), for all A € C. 

Proof. As in Example B.140, note that 

cI A (X - Q) = c(XI A - E(XI A )) - c(QI A - E(QI A )) + c[E(QI A ) - E(XI A )]. 

The choice of Q can be coherent if and only if E(QI A ) = E(XI A ). □ 
The reader should note the similarity between the conditions in Theorem B.141 
and Definition B.23. The function Q must be a version of the conditional mean 
of X given C. 

Example B.142. Let (X, Y) be random variables with a traditional joint density 
with respect to Lebesgue measure fx,Y- That is, for all C G IR 2 , 

Pr((X, Y)eC)= [ fx,Y(x,y)dxdy, 
Jc 

and for all bounded measurable functions g : IR 2 — > IR, 

E(g(X,Y)) = J g(x,y)fx,Y(x,y)dxdy. (B.143) 

Let C be the <7-field generated by Y. That is, C = {Y' X (A) : A € where B is 
the Borel a-field of subsets of IR. It is straightforward to check that for all A 6 C, 
E{XI A ) = E(QI A ), where Q(s) = h(Y{s)), and 



and f Y {y) = ffx,Y(x,y)dx is the usual marginal density of Y. (Just apply 
(B.143) with g{x,y) = xh(y) and with g(x,y) = x/ c (y), where A = Y~ l (C).) 

What we have done in this section is give a motivation for the use of the math- 
ematical probability calculus to express uncertainty for the purposes of gambling. 
We assume that an agent chooses which gambles to accept in such a way that he 
or she is not subject to Dutch book, which is a combination of acceptable gam- 
bles that produces a loss no matter what happens. We were also able to use this 
approach to motivate the mathematical definition of conditional expectation by 
introducing conditional gambles and requiring that the same coherence condition 
apply to conditional and unconditional gambles alike. 



B.7. Simulation 659 



B.7 Simulation* 



Several times in this text, we will want to generate observations that have a 
desired distribution. Such observations will be called pseudorandom numbers be- 
cause samples appear to have the properties of random variables, but they are 
actually generated by a complicated deterministic process. We will not go into 
detail on how pseudorandom numbers with uniform 17(0, 1) distribution are gen- 
erated. In this section, we wish to prove a couple of useful theorems about how to 
generate pseudorandom numbers with other distributions under the assumption 
that pseudorandom numbers with U(0, 1) distribution can be generated. 

Theorem B.144. Let F be a CDF and define the inverse of F by 

F~ l ( a ) = J inf {* : F ( x ) ^ > 0, 

w \ sup{x : F(x) > 0} ifq = 0. 

IfU has 17(0,1) distribution, then X = F" l (U) has CDF F. 

PROOF. We will calculate Pr(X < t) for all t. First, let t be a continuity point of 
F. Then 

Pr(X < t) = Pt{F~ 1 (U) < t) = Pr(l/ < F(t)) = F(i), 
where the second equality follows from the fact that, at a continuity point t, 
X < t if and only if U < F(t), and the third equality follows from the fact 
that U has £7(0, 1) distribution. Finally, let t be a jump point of F and let 
F(t) - lim xTt F(x) = c. Then X = t if and only if t - c < £/ < t, so 

Pr(X = t) = Pr(t - c < U < t) = a 

So, X has CDF F at continuity points of F and its distribution has the same 
sized jumps as F at the same points. So the CDF of X is F. □ 
This theorem allows us to generate pseudorandom variables with arbitrary 
CDF F, if we can find F \ The method described in this theorem is called the 
probability integral transform. Note that the probability integral transform has a 
surprising theoretical implication. 

Proposition B.145. Let U have tf(0, 1) distribution, and let X be a random 
quantity taking values in a Borel space X. Then there exists a measurable function 
f : [0, lj --> X such that f(U) has the same distribution asX. 

The next theorem allows us to find pseudorandom variables with arbitrary 
density / if we can generate pseudorandom variables with another density g such 
that f(x) < kg(x) for some number k and all x. 

Theorem B.146 (Acceptance-rejection). Let f be a nonnegative integrable 
function, and let g be a density function. Letk>0 and suppose that fix) < kq(x) 
for all x. Suppose that {K^i and are all independent and that the K 

have density g and the Ui are C7(0, 1). Define Z = Y N) where 

N = minli:Ui< \ 



*This section may be skipped without interrupting the flow of ideas. 



660 Appendix B. Probability Theory 



Then Z has density proportional to f. 
Proof. We can write the CDF of Z as 

E[pr(r i <t,i/ i <^|y < )] 

where we have used the law of total probability B.70 in the last equation. The 
conditional probability in the numerator is 



The mean of this is 



since Yi has PDF g(-). Similarly, the denominator conditional probability can be 
written as 

PT { Ui -k^)\ Y ) = k^§)- 

The mean of this is likewise seen to be / f(y)dy/k. The ratio of these is 

p ' (zs,) = 775j*' 

hence Z has density proportional to /. D 
Next, we prove a theorem that allows us to simulate from distributions with 
bounded densities and sufficiently thin tails even when we only know the density 
up to a normalizing constant. The theorem is due to Kinderman and Monahan 
(1977). 

Theorem B.147 (Ratio of uniforms method). Let f : R -* [0,oo) be an 
integrable function. Define 



A=|(u,t;)€ffi 2 :0<u<y^)| 



If (17, V) has uniform distribution over the set A, then V/U has density propor- 
tional to f. 

PROOF Let (17, V) be uniformly distributed on the set A. Then fuy{u,v) = 
IaM/c, where c is the area of A. Define X = U and Y = V/U. The Jacobian 
for the transformation is x and the joint density of (X, Y) is 

/x.y(*,v) = -jA{x,xy) = ^ [0<v 7^)] (x) - 



B.8. Problems 661 



It follows that Mv) = Sf™ ~c dx = &/(»)• 



If both f(x) < b and a < x^/ f(x) < b for all x, then A is contained in the 
rectangle with opposite corners (0, o) and (6, c). We can then generate U ~ [7(0, 6) 
and V - 1/(0, c). We set X = vyt/, and if C/ 2 < /(X), take X as our desired 
random variable. If U 2 < /(X), try again. 

An important application of simulation is to the numerical integration tech- 
nique called importance sampling. Suppose that we wish to know the value of the 
ratio of two integrals 

TWw ' ( ] 

where 6 can be a vector. Suppose that / is a density function such that h/f is 
nearly constant and it is easy to generate pseudorandom numbers with density 
/. Let {Xi}^ be an IID sequence of pseudorandom numbers with density /. 
Then 



jv{6)h(B)de = E (v(X0j^, 



where the expectations are with respect to the pseudo-distribution of X». If we let 
Wi = h(Xj)/f(Xi) and Zi = v(Xi)Wi, then the weak law of large numbers B.95 
says that Z n /W n converges in probability to (B. 148). 69 The reason that we want 
h/f to be nearly constant is so that the variance of Wi is small. In Section 7.1.3, 
we will show how to approximate the variance of Z n /W n as an estimate of 
(B.148). 



B.8 Problems 

Section B.2: 

1. Suppose that an urn contains m > 3 white balls and n > 3 black balls. 
Suppose that the urn is well mixed so that at any time, the probability 
that any one of the remaining balls in the urn is as likely to be drawn as 
any other. We will draw three balls without replacement and set Xi = 1 if 
the ith ball drawn is black, Xi = 0 if the ith ball is white. Show that 

Pr(Xi = 1, X 2 = 0, X 3 = 1) = Pr(Xi = 0, X 2 = 1, X 3 = 1) 

= Pr(Xi = l f X 2 = l,X 3 =0). 

2. Suppose that if is a nondecreasing function and 

F(x) = inf H (t). 

t > X 

t rational 



69 The strong law of large numbers 1.63 says that Z n /W n converges a.s. to 
(B.148). 



662 Appendix B. Probability Theory 



(a) Prove that F is continuous from the right. 

(b) Prove that inf all x H(x) = inf a u x F(x). 

(c) Prove that sup all x H(x) = sup aU x F(x). 



Section B.3: 



3. Using the definition of conditional probability, show that AnB = 0 implies 
Pt(A\C) + Pr(B\C) = Pr(A U B|C), a.s. 
Use this to help prove that {A n }~ =1 disjoint implies 



4.*Let X\ and X2 be IID random variables with U(0, 1) distribution. Let 



Using the definition of conditional distribution, show that the conditional 
distribution of X\ given T = t is a mixture of a point mass at t and a 
U (0, £) distribution. Also, find the mixture. 

5. Let (5, .4, /x) be a probability space. Let C be a sub-a-field such that jz(C) € 
{0, 1} for all CeC. Let E|X| < 00. Prove that E(X\C) = E(X), a.s. [/i]. 

6. Let (5,^4,/x) be a probability space. Let {A n }%Li be a partition of 5, 
and let C be the smallest a- field containing {i4 n }£Li. Let X be a random 
variable. Show that E(X\C) = lA n w n , where 



7. Let $ denote the standard normal CDF, and let the joint CDF of random 
variables (X, Y) be 



8. Prove Proposition B.25 on page 617. (Hint: Use part 4 of Proposition A.49.) 

9. Prove Proposition B.26 on page 617. 

10. Prove Proposition B.27 on page 617. (Hint: Prove it for g an indicator func- 
tion, then for simple functions, then for nonnegative measurable functions, 
then for all integrable functions.) 

11. Prove Proposition B.28 on page 617. 




a.s. 



T = max{Xi,X 2 }. 





(a) Find the conditional distribution of X given Y. 

(b) Find the conditional distribution of Y given X. 



B.8. Problems 663 



12. Suppose that X\, . . . , X n are independent, each with distribution AT(c, 1). 
Find the conditional distribution of Xi, . . . , X n given X n = x, where X n = 

13. Let #i C ft £ • • • be a sequence of cr-fields, and let X > 0. Suppose that 
E(X|# n ) = V for all n. Let # be the smallest a-field containing all of the 
B n - Show that E(X\B) = Y, a.s. (iiTmJ: Show that the union of the B n is a 
7r-system, and use Theorem A. 26.) 

14. Prove Proposition B.43 on page 623. 

15. Assume the conditions of Theorem B.46. Also, suppose that (X,Bi,vi) 
and (y, #2, ^2) are <r-finite measure spaces and v — v\ x 1/2. Prove that v\ 
can play the role of vx\y(-\y) for all y and that i/ 2 can play the role of vy 
in the statement of Theorem B.46. 

16. Prove Proposition B.51 on page 625. (Hint: Notice that lA(v~ 1 (y i w)) = 

17. Prove Proposition B.66 on page 631. (Hint: Prove the result for product 
sets first, and then use Theorem A. 26.) 

18. Prove Corollary B.67 on page 631. 

19. Prove Corollary B.74 on page 633. 

20. Prove the second Borel-Cantelli lemma: If {j4 n }5?Li are mutually inde- 
pendent and Y,n=i Fr ( A n) = 00, then Pr(flg 1 U~ =i A n ) = 1. (This set 
is sometimes called A n infinitely often. )(Hint: Find the probability of the 
complement by using the fact that 1 — x < exp(— x) for 0 < x < 1.) 

21. *Suppose that (S,A,fi) is a measure space. Let {/ n }£Li be a sequence of 

measurable functions f n : S — ► T, where (T, B) is a metric space with Borel 
(T-field. Let C be the tail <r-field of {/ n }?Li- If lim n — 00 fn(s) = /(s), for all 
s, then prove that / is measurable with respect to C. (Hint: Refer to the 
proof of part 5 of Theorem A. 38. Show that the set A* G C by showing 
that the union in (A.39) does not need to start at 1.) 

22. Let (5, ^4, /x) be a probability space, and let C be the tail cr-field of a se- 
quence of random quantities {X n }^L ly where X n ' S — > X for all n. Let 
V be the a-field generated by {X n }£° =1 . Let X = (Xi,X 2 ,...) € X°° . 
If 7r is a permutation of a finite set of integers {l,...,n}, let ttX = 
(X^i), . . . ,X^( n ),X n +i, . . .). We say that A £ V is symmetric if A = 
X _1 (B) and for every permutation 7r of finitely many coordinates, A = 
(ttX)" 1 ^) as well. 

(a) Prove that every C 6 C is symmetric. 

(b) Show that there can be symmetric events that are not in C. 

23. Prove Proposition B.78 on page 634. 



664 Appendix B. Probability Theory 



Section B.4' 

24. Find a sequence of random variables that converges in probability to 0 
but does not converge a.s. to 0. {Hint: Consider the countable collection 
of all subsets of [0, 1] of the form [k/2 n , (k + l)/2 n ] with k and n integers. 
Arrange them in an appropriate sequence.) 

25. Let {X n }S°=i be a sequence of random variables, and let X be another 
random variable. Let F n be the CDF of X n and let F be the CDF of X. 
Prove that X n X if and only if lim n -oo F n {x) = F(x) for every x such 
that F is continuous at x. 

26. Prove that | exp(iy) - 1| < min{|y|, 2} for all y. (Hint: Show that exp(iy) = 
1 + i f* exp(is)ds for y > 0 and a similar formula for y < 0.) 

27. Prove the weak law of large numbers for infinite means: Suppose that 
{Xi}^ are IID^with mean oo. Then, for all real x, lim n — oo Pr(X n > 
x) = 1, where X n = ]C" =1 Xi/n. (Hint: Define Y iit = min{Xi,t}. Prove 
that E(y i)t ) < oo for all t, but lim t -+oo E(Y itt ) = oo.) 

28. *Suppose that X is a random vector having bounded density with respect to 

Lebesgue measure. Prove that the characteristic function of X is integrable. 
(Hint: Run the proof of Lemma B.101 in reverse.) 

Section B.5: 

29. Let {i n }n=i be a sequence of numbers in {0, 1}. Suppose that {-Xn}nLi is 
a sequence of Bernoulli random variables such that 

Pr(Xi =i u ...,X n = i n ) = ^2^4j, 

where x = Show that this specifies a consistent set of joint distri- 

butions for n — 1, 2, — 

30. Let /x be a finite measure on (H, 5), where B is the Borel a-field. Suppose 
that {X(t) : -oo < t < oo} is a stochastic process such that X(t) has 
J3e£a(ju(-oo,t],/i(t,oo)) distribution for each t, X(t) > X(s) if t > s, and 
X(-) is continuous from the right. 

(a) Prove that Pr(lim t -.oo X(t) = 1) = 1. 

(b) Let {/ = inf{* : X(t) > 1/2}. Prove that the median of U is inf{* : 
M (-oo,t] > /x(*,oo)}. (tfin*: Write {f/ < 5} in terms of X(-).) 

31. Let R be a set, and let Br) be a Borel space for every r £ R. Let X = 
J1 r AV and let # be the product <r-field. For each r € R, let X r : X 
be r the projection function X r (x) = x r . Prove that # is the union of all of 
the (T-fields generated by all of the countable collections of X r functions. 
That is, let Q be the set of all countable subsets of R, and for each qeQ 
let X q = {X r }req and let B q be the a-field generated by X q . Then show 
that B = U q eQB q . 

Section B. 7: 

32. Prove Proposition B.145 on page 659. 



Appendix C 

Mathematical Theorems Not 
Proven Here 



There are several theorems of a purely mathematical nature which we use on 
occasion in this text, but which we do not wish to prove here because their proofs 
involve a great deal of mathematical background of which we will not make use 
anywhere else. 



C.l Real Analysis 

Theorem C.l (Taylor's theorem). 1 Suppose that f : JR m 2R has con- 
tinuous partial derivatives of all orders up to and including k -f 1 with respect 
to all coordinates in a convex neighborhood D of a point xo. For x € D and 
i = 1, . . . , k 4- 1, define 



m m 



Z =* 5=1 / 



where we allow notation like d 3 jdz\dz\dz± to stand for d 3 jdz\dz±. Then, for 
x £ D, 

k 

f(x) = /(xo) + ^ ^(f-xoix - xo) + jJ^D^ l) (f ]X \x - xo), 
where x* is on the line segment joining x and xo . 



^his theorem is used in the proofs of Theorems 7.63, 7.89, 7.108, and 7.125. 
For a proof (with m = 2), see Buck (1965), Theorem 16 on page 260. 



666 Appendix C. Mathematical Theorems Not Proven Here 



Theorem C.2 (Inverse function theorem). 2 Let f be a continuously differ- 
entiable function from an open set in JR n into 2R n such that ((dfi/Oxj)) is a 
nonsingular matrix at a point x. If y = f(x), then there exist open sets U and V 
such that x£U,y£V,fis one-to-one on U, and f(U) = V. Also, ifg:V—*U 
is the inverse of f on U, then g is continuously differentiable on V. 

Theorem C.3 (Stone-Weierstrass theorem). 3 Let A be a collection of con- 
tinuous complex functions defined on a compact set C and satisfying these con- 
ditions: 

• V f £ A, then the complex conjugate of f is in A. 

• If x\ ^ X2 € C, then there exists f eA such that /(xi) ^ ffa). 

• If /> 9 € A, then f -f g € A and fg € A. 

• If f € A and c is a constant, then cf £ A. 

• For each x £ C, there exists f € A such that f(x) ^ 0. 

Then, for every continuous complex function f on C, there exists a sequence 
{/n}5JLi in A such that f n converges uniformly to f on C. 

Theorem C.4 (Supporting hyperplane theorem). 4 If S is a convex subset 
of a finite- dimensional Euclidean space, and xo is a boundary point of S, then 
there is a nonzero vector v such that for every x € S, v T x > v T xo. 

Theorem C.5 (Separating hyperplane theorem). 5 If Si and S2 are disjoint 
convex subsets of a finite- dimensional Euclidean space, then there is a nonzero 
vector v and a constant c such that for every x € Si, v T x < c and for every 
y e S 2) v T y > c. 

Theorem C.6 (Bolzano-Weierstrass theorem). 6 Suppose that B is a closed 
and bounded subset of a finite- dimensional Euclidean space. Then every infinite 
subset of B has a cluster point in B. 



C.2 Complex Analysis 

Theorem C.7. 7 Let f be an analytic function in a neighborhood of a point z. 
Then the derivatives of f of every order exist and are analytic in a neighborhood 



2 This theorem is used in the proof of Theorem 7.57. For a proof, see Rudin 
(1964), Theorem 9.17. 

3 This theorem is used in the proofs of DeFinetti's representation theorem 1.49 
and 1.47 and Theorem B.93. For a proof, see Rudin (1964), Theorem 7.31. 

4 This theorem is used in the proof of Theorem B.17. For a proof, see Berger 
(1985), Theorem 12 on page 341, or Ferguson (1967), Theorem 1 on page 73. 

5 This theorem is used in the proof of Theorems B.17, 3.77 and 3^95. For a 
proof, see Berger (1985), Theorem 13 on page 342, or Ferguson (1967), Theorem 2 
on page 73. ... 

6 This theorem is used in the proof of Theorem 3.77. For a proof, see Dugundji 
(1966), Theorems 3.2 and 4.3 of Chapter XI. TT ^^ TT1? A • 

7 This theorem is used to show that certain estimators are UMVUb, and in 
the proof of Theorem 2.74. For a proof, see Churchill (1960, Sections 52 and 56). 



C.3. Functional Analysis 667 



of z. If p ' denotes the fcth derivative of f, then 



for all x in some circle around z. 

Theorem C.8 (Maximum modulus theorem). 8 Let } be an analytic func- 
tion in an open set D which is continuous on the closure of D. Let the maximum 
value of \f(z)\ for z in the closure of D be c. Then \ f(z)\ < c for all z € D unless 
f is constant on D. 

Theorem C.9 (Cauchy's equation). 9 Let G be a Borel subset of IR k with 
positive Lebesgue measure. Let f : G — ► IR be measurable. Let H\ = G and 
H n = H n -i + G for each n. For each n, let g n : H n — ► IR be measurable such 
that g n (Sr=i Xi ) ~ 2r=i f( Xi )> f or a l mos t all (xi, . . . ,x n ) € G n . Then there is 
a real number a and a vector b € IR k such that f(x) = a + b T x a.e. inG. 

C.3 Functional Analysis 

Theorem C.10. 10 IfT is an operator with finite norm on the Hilbert space L 2 (ii) 
given by T(f)(x) = J K(x' \x)dn(x'), then T is of Hilbert-Schmidt type if and 
only if 



Theorem C.ll. 11 Every operator of Hilbert-Schmidt type is completely contin- 
uous. 

Theorem C.12. 12 If T is a completely continuous self-adjoint operator, then T 
has an eigenvalue X with |A| = ||T||. 

Theorem C.13. 13 IfT is a linear operator with finite norm and T* is its adjoint 
operator, then ||T*T|| = ||T|| 2 . 



8 This theorem is used in the proof of Theorem 2.64. For a proof, see Churchill 
(1960), Section 54, or Ahlfors (1966), Theorem 12' on page 134. 

9 This theorem is used in the proof of Theorem 2.114. For a proof, see Diaconis 
and Freedman (1990), Theorem 2.1. 

10 This theorem is used in the proof of Theorem 8.40. For a proof, see Sec- 
tion XI.6 of Dunford and Schwartz (1963). By L 2 (/x) we mean {/ : J / 2 (x)d/x(x) < 



This theorem is used in the proof of Theorem 8.40. For a proof, see Theo- 
rem 6 of Section XI.6 of Dunford and Schwartz (1963). The reader should note 
that Dunford and Schwartz (1963) use the term compact instead of completely 
continuous. 

12 This theorem is used in the proof of Theorem 8.40. For a proof, see Lemma 1 
in Section VIII.3 of Berberian (1961). 

13 This theorem is used in the proof of Theorem 8.40. For a proof, see part (5) 
of Theorem 2 on p. 132 of Berberian (1961). 





Appendix D 

Summary of Distributions 



The distributions used in this book are listed here. We give the name and sym- 
bol used to describe each distribution. Each distribution is absolutely continuous 
with respect to some measure or other. In most cases the mean and variance are 
given. In some cases, the symbol for the CDF is given. 

D.l Univariate Continuous Distributions 
Alternate noncentral beta 

Symbol: ANCB(q,a,i) 

Density: ft W = E « . flljg ( , - 7 , Vj^fl^l—a - ,)!«" 
Dominating measure: Lebesgue measure on [0,1] 

Alternate noncentral chi-squared 

Symbol: ANC^{q y a,i) 1 

Density: /*(*) = £~o T&$tM - 7)* «p(-f ) 

Dominating measure: Lebesgue measure on [0, oo) 
Mean: q + a 7^ 
Variance: 2[q + a 2 ^ 1 } 

^his distribution was derived without a name by Geisser (1967). It was named 
L 2 by Lecoutre and Rouanet (1981). 



D.l. Univariate Continuous Distributions 669 



Alternate noncentral F 
Symbol: ANCF(q, a, 7) 2 

Density fx(s) = T~ £^) 7 *(l- 7 )3 q "^Ui^ 
Density. /x W 2^ fc -o "Sirtfy^ u V ^WF+i) (a+ qx) *¥+ 2k 

Dominating measure: Lebesgue measure on [0, oo) 

Mean: (1 - 7)^2 + 7* , if a > 2 

Variance- ^V- 2 *^ 1 ^) 2 ■ 4a 2 7 (i- 7 ) ifa>4 
variance. ( a -2)^(a-4) q + (a-2) q 2 > 11 a ^ * 

Beta 

Symbol: Beta(a,/3) 

Density: /*(*) = - x)* 3 " 1 

Dominating measure: Lebesgue measure on [0, 1] 
Mean: ^9 

Variance: (aW(a+/W) 

Cauchy 

Symbol: Cau(ii,a 2 ) 
Density: f x (x) = tt" 1 (l 4- 

Dominating measure: Lebesgue measure on (—00,00) 
Mean: Does not exist 
Variance: Does not exist 

Chi-squared 

Symbol: xl 

Density: f x (x) = exp(-f ) 

Dominating measure: Lebesgue measure on [0, 00) 
Mean: a 
Variance: 2a 

2 The alternate noncentral F distribution, with a different scaling factor, was 
called the t/> 2 distribution by Rouanet and Lecoutre (1983). See also Lecoutre 
(1985). The distribution was derived without a name by Ferrandiz (1985). 
Schervish (1992) gives additional details concerning the ANC\ 2 , ANCB, and 
ANCF distributions. 



670 Appendix D. Summary of Distributions 



Exponential 

Symbol: Exp(0) 

Density: fx(x) = 0exp(— xO) 

Dominating measure: Lebesgue measure on [0, oo) 
Mean: \ 
Variance: ^ 

F 

Symbol: F,, a 

Density: f x (x) = ^jff xi- l (a + qx)- 3 F 

Dominating measure: Lebesgue measure on [0, oo) 
Mean: if a > 2 

a— 2 1 

Variance: 2a \ (a %- 2 _ 2) ^ if a > 4 

Gamma 
Symbol: r(a,0) 

Density: /x(x) = j^x*" 1 exp(-#r) 
Dominating measure: Lebesgue measure on [0, oo) 
Mean: | 
Variance: 

Inverse gamma 
Symbol: T~ l {a,l5) 

Density: /x(x) = ^x" 0 " 1 exp(-f ) 
Dominating measure: Lebesgue measure on [0, oo) 
Mean: ^ , if a > 1 
Variance: ^.^^.^ if a > 2 

Laplace 

Symbol: Lap{ii,a) 

Density: /*(*) = £ exp (-^). 

Dominating measure: Lebesgue measure on 1R 

Mean: \i 
Variance: 2cr 2 



D.l. Univariate Continuous Distributions 



Noncentral beta 

Symbol: NCB{a,(3,ip) 

Density: /*(*) = £~ o d )*«*(-* ) ^fe^Hl - 
Dominating measure: Lebesgue measure on [0, 1] 

Noncentral chi-squared 

Symbol: NC X 2 q W 

Density: /*(*) = £~ 0 it)" «p(-|) «pH) 

Dominating measure: Lebesgue measure on [0, oo) 
Mean: q + xj) 
Variance: 2q + 4^ 

Noncentral F 

Symbol: NCF(q,a^) 

Density: = exp(-f) klr \^U)\^ ) ^ 

Dominating measure: Lebesgue measure on [0, oo) 
Mean: (l + if a > 2 

Variance: 2 ( ? ) 2 if a > 4 



Noncentral £ 

Symbol: NCt a (6) 



fcW - E=. ^ «p(-* ) (' + • ) 

Dominating measure: Lebesgue measure on IR 
Mean: flllg^^/f , if a > 1 

Variance: ^ - ^ [llgl] 2 , if a > 2 
CDF: NCT a ('\S) 

Normal 

Symbol: N(h,<j 2 ) 

Density: f x (x) = (y/to*)' 1 exp (-^^) 
Dominating measure: Lebesgue measure on (—00,00) 



±k±i 



672 Appendix D. Summary of Distributions 

Mean: \i 
Variance: a 2 

CDF: $(•) (For JV(0, 1) distribution) 

Pareto 

Symbol: Par(a,c) 



Density: f x (x) = £fr 

Dominating measure: Lebesgue measure on [c, oo) 
Mean: if a> 1 



Dominating measure: Lebesgue measure on (—00,00) 
Mean: /i, if a > 1 
Variance: <r 2 -~, if a > 2 

a— z ' 

CDF: T 0 (-) (For * o (0,l) distribution) 

Uniform 

Symbol: U(a,b) 
Density: fx(x) = {b- a) -1 

Dominating measure: Lebesgue measure on [a, 6] 

Mean: ^ 

Variance: 

D.2 Univariate Discrete Distributions 

Bernoulli 

Symbol: Ber(p) 

Density: fx(x) = p x {l-p) l ~ x 

Dominating measure: Counting measure on {0, 1} 

Mean: p 

Variance: p(l - p) 



Variance: 



(0-2)^-1)2 ' ifQ > 2 



t 



Symbol: t a (iJ>,cr 2 ) 




D.2. Univariate Discrete Distributions 



673 



Binomial 

Symbol: Bin(n,p) 

Density: f x (x) = (l)p x (l - pf~ x 

Dominating measure: Counting measure on {0, . . . , n} 

Mean: np 

Variance: np{\ — p) 

Geometric 
Symbol: Geo(p) 
Density: f x (x) = p(l - p) x 

Dominating measure: Counting measure on {0, 1,2, . . .} 

Mean: ^ E 

Variance: 

p* 

Hypergeometric 

Symbol: Hyp(N,n,k) 

,n\(N-n\ 

Density: f x (x) = U '>V ; 

Dominating measure: Counting measure on 
{max{0, n — N 4- fc}, . . . , min{n, A;}} 

Mean: ^ 

Variances (*) (fcf ) 

Negative binomial 

Symbol: Negbin(a,p) 

Density: /*(*) = ^P°(l - p)" 

Dominating measure: Counting measure on {0, 1, 2, . . .} 
Mean: a~£ 
Variance: a^f 

Poisson 

Symbol: Poi(X) 

Density: fx(x) = exp(-A)^ 

Dominating measure: Counting measure on {0, 1, 2, . . .} 
Mean: A 
Variance: A 



674 Appendix D. Summary of Distributions 



D.3 Multivariate Distributions 
Dirichlet 

Symbol: Dirk(ai, . . . , Qfc) 

Density: fx lt ... t x k ^[x u • • • ,a*-i) = rcaiTrW) ^ 1 " 1 ' ' '^-i 1 "^ 1 " Xl " 

Xfc-i) afc_1 , where a 0 = ^i=i a * 

Dominating measure: Lebesgue measure on 

{(#i, . . . ,Xfc_i) : all Xi > 0 and xi H Xk-i < 1} 

Mean: E(X*) = 

Variance: Var(X0 = 

Covariance: Covp^X,) = - a ^ +1) 

Multinomial 

Symbol: Multk (n, pi , . . . , pk ) 

Density: fx lt ... t x k On, • . • , a*) = ( Xli n ., x Jp? 1 ■ • P x k h 
Dominating measure: Counting measure on 

{(#1 , . . . , x k ) : all Xi e {0, . . . , n } and xi H h x fc = n} 

Mean: E(Xi) = npi 

Variance: Var(Xi) = npi(l — pi) 

Covariance: Cov(Xi,Xj) = —npipj 

Multivariate Normal 

Symbol: iV p (/x 5 a) 

Density: f x (x) = (2tt)" 5 |er| ~ i exp(-±(x - /x) T cr- 1 (x - /x)) 
Dominating measure: Lebesgue measure on 1R P 
Mean: E(Xi) = fn 
Variance: Var(Xi) = Oi y i 
Covariance: Cov(Xi,Xj) = (Tij 



References 



Ahlfors, L. (1966). Complex Analysis (2nd ed.). New York: McGraw-Hill. 

Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. 
Cambridge: Cambridge University Press. 

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychoto- 
mous response data. Journal of the American Statistical Association, 88, 
669-679. 

Aldous, D. J. (1981). Representations for partially exchangeable random vari- 
ables. Journal of Multivariate Analysis, 11, 581-598. 

Aldous, D. J. (1985). Exchangeability and related topics. In P. L. Hennequin 
(Ed.), Ecole d'Ete de Probabilites de Saint-Flour XIII-1983 (pp. 1-198). 
Berlin: Springer- Verlag. 

Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis 
(2nd ed.). New York: Wiley. 

Anscombe, F. J. and Aumann, R. J. (1963). A definition of subjective proba- 
bility. Annals of Mathematical Statistics, 34, 199-205. 

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to 
Bayesian nonparametric problems. Annals of Statistics, 2, 1152-1174. 

Bahadur, R. R. (1957). On unbiased estimates of uniformly minimum variance. 
Sankhya, 18, 211-224. 

Barnard, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott. 
Journal of the Royal Statistical Society (Series B), 32, 194-195. 

Barnard, G. A. (1976). Conditional inference is not inefficient. Scandinavian 
Journal of Statistics, 3, 132-134. 

Barndorff-Nielsen, O. E. (1988). Parametric Statistical Models and Likeli- 
hood. Berlin: Springer- Verlag. 

Barnett, V. (1982). Comparative Statistical Inference (2nd ed.). New York: 
Wiley. 

BARRON, A. R. (1986). Discussion of "On the consistency of Bayes estimates" 
by Diaconis and Freedman. Annals of Statistics, 14, 26-30. 

Barron, A. R. (1988). The exponential convergence of posterior probabilities 
with implications for Bayes estimators of density functions. Technical Re- 
port 7, Department of Statistics, University of Illinois, Champaign, IL. 

BASU, D. (1955). On statistics independent of a complete sufficient statistic. 
Sankhya, 15, 377-380. 

Basu, D. (1958). On statistics independent of sufficient statistics. Sankhya, 20, 
223-226. 



676 References 



Bayes, T. (1764). An essay toward solving a problem in the doctrine of chances. 
Philosophical Transactions of the Royal Society of London, 53, 370-418. 

Becker, R. A., Chambers, J. M., and Wilks, A. R. (1988). The New S 
Language: A Programming Environment for Data Analysis and Graphics. 
Pacific Grove, CA: Wadsworth and Brooks/Cole. 

Berberian, S. K. (1961). Introduction to Hilbert Space. New York: Oxford 
University Press. 

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd 
ed.). New York: Springer- Verlag. 

Berger, J. O. (1994). An overview of robust Bayesian analysis (with discussion). 
Test, 3, 5-124. 

Berger, J. O. and Berry, D. A. (1988). The relevance of stopping rules in 
statistical inference (with discussion). In S. S. Gupta and J. O. Berger 
(Eds.), Statistical Decision Theory and Related Topics TV (pp. 29-72). New 
York: Springer- Verlag. 

Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: The 
irreconcilability of P values and evidence (with discussion). Journal of the 
American Statistical Association, 82, 112-122. 

Berk, R. H. (1966). Limiting behavior of posterior distributions when the model 
is incorrect. Annals of Mathematical Statistics, 37, 51-58. 

Berkson, J. (1942). Tests of significance considered as evidence. Journal of the 
American Statistical Association, 37, 325-335. 

Berti, P., Regazzini, E., and Rigo, P. (1991). Coherent statistical inference 
and Bayes theorem. Annals of Statistics, 19, 366-381. 

Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the 
bootstrap. Annals of Statistics, 9, 1196-1217. 

Billingsley, P. (1968). Convergence of Probability Measures. New York: Wiley. 

Billingsley, P. (1986). Probability and Measure (2nd ed.). New York: Wiley. 

Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete 
Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. 

BLACKWELL, D. (1947). Conditional expectation and unbiased sequential esti- 
mation. Annals of Mathematical Statistics, 18, 105-110. 

Blackwell, D. (1973). Discreteness of Ferguson selections. Annals of Statistics, 
1, 356-358. 

Blackwell, D. and Dubins, L. (1962). Merging of opinions with increasing 
information. Annals of Mathematical Statistics, 33, 882-886. 

Blackwell, D. and Ramamoorthi, R. V. (1982). A Bayes but not classically 
sufficient statistic. Annals of Statistics, 10, 1025-1026. 

Blyth, C. R. (1951). On minimax statistical decision procedures and their 
admissibility. Annals of Mathematical Statistics, 22, 22-42. 

Bondar, J. V. (1988). Discussion of "Conditionally acceptable frequentist so- 
lutions" by George Casella. In S. S. GUPTA and J. O. BERGER (Eds.), 
Statistical Decision Theory and Related Topics IV (pp. 91-93). New York: 
Springer- Verlag. 



References 677 



Bortkiewicz, L. V. (1898). Das Gesetz der Kleinen Zahlen. Leipzig: Teubner. 

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations (with 
discussion). Journal of the Royal Statistical Society (Series B), 26, 211-246. 

Box, G. E. P. and TlAO, G. C. (1968). A Bayesian approach to some outlier 
problems. Biometrika, 55, 119-129. 

Breiman, L. (1968). Probability. Reading, MA: Addison- Wesley. 

Brenner, D., Fraser, D. A. S., and McDunnough, P. (1982). On asymp- 
totic normality of likelihood and conditional analysis. Canadian Journal of 
Statistics, 10, 163-172. 

Brown, L. D. (1967). The conditional level of Student's t test. Annals of 
Mathematical Statistics, 38, 1068-1071. 

Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble 
boundary value problems. Annals of Mathematical Statistics, 42, 855-903. 
(See also correction, Annals of Statistics, 1, 594-596.) 

Brown, L. D. and Hwang, J. T. (1982). A unified admissibility proof. In S. S. 

Gupta and J. O. Berger (Eds.), Statistical Decision Theory and Related 

Topics III (pp. 205-230). New York: Academic Press. 

BUCK, C. (1965). Real Analysis (2nd ed.). New York: McGraw-Hill. 

Buehler, R. J. (1959). Some validity criteria for statistical inferences. Annals 
of Mathematical Statistics, 30, 845-863. 

Buehler, R. J and Fedderson, A. P. (1963). Note on a conditional property 

of Student's t. Annals of Mathematical Statistics, 34, 1098-1100. 
Casella, G. and Berger, R. L. (1987). Reconciling Bayesian and frequentist 

evidence in the one-sided testing problem (with discussion). Journal of the 

American Statistical Association, 82, 106-111. 
Chaloner, K., Church, T., Louis, T. A., and Matts, J. P. (1993). Graphical 

ehcitation of a prior distribution for a clinical trial. The Statistician, 42, 

341—353. 

Chang T and Villegas, C. (1986). On a theorem of Stein relating Bayesian 
289-296 inferences in S rou P models. Canadian Journal of Statistics, 14, 

Chapman, D. and Robbins, H. (1951). Minimum variance estimation without 
regularity assumptions. Annals of Mathematical Statistics, 22, 581-586. 

Chen C.-F. (1985). On asymptotic normality of limiting density functions with 
4^540-546 ° nS ' JOUmal ° f ^ ROml Statistical Societ V ( Se ™ B), 

CH ° t/' S '/!? B , BINS ; and SlEGMU ND, D. (1971). Great Expectations: The 
Iheory of Optimal Stopping. New York: Houghton Mifflin. 

CHURCHILL R^V. (1960). Complex Variables and Applications (2nd ed.). New 
York: McGraw Hill. 

Clarke, S. and Barron, A. R. (1994). Jeffreys' prior is asymptotically least 
favorable under entropy risk. Journal of Statistical Planning and Inference, 
41, 37-60. 



678 References 



Cornfield, J. (1966). A Bayesian test of some classical hypotheses — with ap- 
plications to sequential clinical trials. Journal of the American Statistical 
Association, 61, 577-594. 

Cox, D. R. (1958). Some problems connected with statistical inference. Annals 
of Mathematical Statistics, 29, 357-372. 

Cox, D. R. (1977). The role of significance tests. Scandinavian Journal of 
Statistics, 4, 49-70. 

Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. London: Chap- 
man and Hall. 

Cramer, H. (1945). Mathematical Methods of Statistics. Princeton: Princeton 
University Press. 

Cramer, H. (1946). Contributions to the theory of statistical estimation. Skan- 
dinavisk Aktuarietidsk, 29, 85-94. 

David, H. A. (1970). Order Statistics. New York: Wiley. 

Dawid, A. P. (1970). On the limiting normality of posterior distributions. Pro- 
ceedings of the Cambridge Philosophical Society, 67, 625-633. 

Dawid, A. P. (1982). Intersubjective statistical models. In G. Koch and 
F. Spizzichino (Eds.), Exchangeability in Probability and Statistics (pp. 
217-232). Amsterdam: North-Holland. 

Dawid, A. P. (1984). Statistical theory: The prequential approach. Journal of 
the Royal Statistical Society (Series A), 147, 278-292. 

Dawid, A. P., Stone, M., and Zidek, J. V. (1973). Marginalization paradoxes 
in Bayesian and structural inference. Journal of the Royal Statistical Society 
(Series B), 35, 189-233. 

DeFinetti, B. (1937). Foresight: Its logical laws, its subjective sources. In H. E. 
Kyburg and H. E. SMOKLER (Eds.), Studies in Subjective Probability (pp. 
53-118). New York: Wiley. 

DeFinetti, B. (1974). Theory of Probability, Vols. I and II. New York: Wiley. 

DeGroot, M. H. (1970). Optimal Statistical Decisions. New York: Wiley. 

DeMoivre, A. (1756). The Doctrine of Chance (3rd ed.). London: A. Millar. 

DlACONis, P. and Freedman, D. A. (1980a). Finite exchangeable sequences. 
Annals of Probability, 8, 745-764. 

Diaconis, P. and Freedman, D. A. (1980b). DeFinetti's generalizations of 
exchangeability. In R. C. Jeffrey (Ed.), Studies in Inductive Logic and 
Probability, II (pp. 233-249). Berkeley: University of California. 

Diaconis, P. and Freedman, D. A. (1980c). DeFinetti's theorem for Markov 
chains. Annals of Probability, 8, 115-130. 

Diaconis, P. and Freedman, D. A. (1984). Partial exchangeability and suf- 
ficiency. In J. K. Ghosh and J. Roy (Eds.), Statistics: Applications and 
New Directions (pp. 205-236). Calcutta: Indian Statistical Institute. 

Diaconis, P. and Freedman, D. A. (1986a). On the consistency of Bayes 
estimates (with discussion). Annals of Statistics, 14, 1-26. 



References 679 



DlACONlS, P. and Freedman, D. A. (1986b). On inconsistent Bayes estimates 
of location. Annals of Statistics, 14, 68-87. 

Diaconis, P. and Freedman, D. A. (1990). Cauchy's equation and DeFinetti's 
theorem. Scandinavian Journal of Statistics, 17, 235-250. 

Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential fam- 
ilies. Annals of Statistics, 7, 269-281. 

DlCKEY, J. M. (1980). Beliefs about beliefs, a theory of stochastic assessments 
of subjective probabilities. In J. M. Bernardo, M. H. DeGroot, D. V. 
Lindley, and A. F. M. Smith (Eds.), Bayesian Statistics (pp. 471-487). 
Valencia, Spain: University Press. 

DOOB, J. L. (1949). Application of the theory of martingales. In Le Calcul des 
Probability et ses Applications (pp. 23-27). Paris: Colloques Internationaux 
du Centre National de la Recherche Scientifique. 

Doob, J. L. (1953). Stochastic Processes. New York: Wiley. 

Dubins, L. E. and Freedman, D. A. (1963). Random distribution functions. 

Bulletin of the American Mathematical Society, 69, 548-551. 
Dugundji, J. (1966). Topology. Boston: Allyn and Bacon. 

Dunford, N. and Schwartz, J. T. (1957). Linear Operators, Part I: General 
Theory. New York: Interscience. 

Dunford, N. and Schwartz, J. T. (1963). Linear Operators, Part II: Spectral 
Theory. New York: Interscience. 

Eberhardt, K. R., Mee, R. W, and Reeve, C. P. (1989). Computing factors 
for exact two-sided tolerance limits for a normal distribution. Communica- 
tions in Statistics— Simulation and Computation, 18, 397-413. 

Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical 
inference for psychological research. Psychological Review, 70, 193-242. 

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of 
Statistics, 7, 1-26. 

Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. 
Philadelphia: Society for Industrial and Applied Mathematics. 

Efron, B. and Hinkley, D. V. (1978). Assessing the accuracy of the max- 
imum likelihood estimator: Observed versus expected Fisher information 
Biometrika, 65, 457-487. 

Efron B. and Morris, C. N. (1975). Data analysis using Stein's estimator 
and its generalizations. Journal of the American Statistical Association, 70, 
oil oiy. 

Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. 
London: Chapman and Hall. 

ESCOBAR, M. D. (1988). Estimating the Means of Several Normal Populations 
by Nonparametric Estimation of the Distribution of the Means. Ph.D. thesis, 
Yale University. 

Fabius, J. (1964). Asymptotic behavior of Bayes' estimates. Annals of Mathe- 
matical Statistics, 35, 846-856. 



680 References 



Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Ap- 
proach. New York: Academic Press. 

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. 
Annals of Statistics, 1, 209-230. 

Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. 
Annals of Statistics, 2, 615-629. 

Ferrandiz, J. R. (1985). Bayesian inference on Mahalanobis distance: An al- 
ternative to Bayesian model testing. In J. M. Bernardo, M. H. DeG- 
root, D. V. Lindley, and A. F. M. Smith (Eds.), Bayesian Statistics 
2: Proceedings of the Second Valencia International Meeting (pp. 645-653). 
Amsterdam: North Holland. 

Fieller, E. C. (1954). Some problems in interval estimation. Journal of the 
Royal Statistical Society (Series B), 16, 175-185. 

Fishburn, P. C. (1970). Utility Theory for Decision Making. New York: Wiley. 

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. 

Philosophical Transactions of the Royal Society of London, Series A, 222A, 

309-368. 

Fisher, R. A. (1924). The conditions under which x 2 measures the discrepancy 
between observation and hypothesis. Journal of the Royal Statistical Society, 
87,442-450. 

FlSHER, R. A. (1925). Theory of statistical estimation. Proceedings of the Cam- 
bridge Philosophical Society, 22, 700-725. 

FlSHER, R. A. (1934). Two new properties of mathematical likelihood. Proceed- 
ings of the Royal Society of London, A, 144, 285-307. 

FlSHER, R. A. (1935). The fiducial argument in statistical inference. Annals of 
Eugenics, 6, 391-398. 

FlSHER, R. A. (1936). Has Mendel's work been rediscovered? Annals of Science, 
1, 115-137. 

Fisher, R. A. (1943). Note on Dr. Berkson's criticism of tests of significance. 
Journal of the American Statistical Association, 38, 103-104. 

FlSHER, R. A. (1966). The Design of Experiments (8th ed.). New York: Hafner. 

Fraser, D. A. S. and McDunnough, P. (1984). Further remarks on asymp- 
totic normality of likelihood and conditional analyses. Canadian Journal of 
Statistics, 12, 183-190. 

Freedman, D. A. (1963). On the asymptotic behavior of Bayes' estimates in 
the discrete case. Annals of Mathematical Statistics, 34, 1386-1403. 

Freedman, D. A. (1977). A remark on the difference between sampling with 
and without replacement. Journal of the American Statistical Association, 
72, 681. 

Freedman, D. A. and Diaconis, P. (1982). On inconsistent M-estimators. 
Annals of Statistics, 10, 454-461. 

Freedman, L. S. and Spiegelhalter, D. J. (1983). The assessment of subjec- 
tive opinion and its use in relation to stopping rules of clinical trials. The 
Statistician, 32, 153-160. 



References 681 



FREEMAN, P. R. (1980). On the number of outliers in data from a linear model. In 
J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith 
(Eds.), Bayesian Statistics (pp. 349-365). Valencia, Spain: University Press. 

Gabriel, K. R. (1969). Simultaneous test procedures — some theory of multiple 
comparisons. Annals of Mathematical Statistics, 40, 224-250. 

Garthwaite, P. and Dickey, J. (1988). Quantifying expert opinion in linear 
regression problems. Journal of the Royal Statistical Society (Series B), 50, 
462-474. 

Garthwaite, P. H. and Dickey, J. M. (1992). Elicitation of prior distributions 
for variable-selection problems in regression. Annals of Statistics, 20, 1697- 
1719. 

Gavasakar, U.K. (1984). A Study of Elicitation Procedures by Modelling the 
Errors in Responses. Ph.D. thesis, Carnegie Mellon University. 

Geisser, S. (1967). Estimation associated with linear discriminants. Annals of 
Mathematical Statistics, 38, 807-817. 

Geisser, S. and Eddy, W. F. (1979). A predictive approach to model selection. 
Journal of the American Statistical Association, 74, 153-160. 

Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to cal- 
culating marginal densities. Journal of the American Statistical Association, 
85, 398-409. 

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions 
and the Bayesian restoration of images. IEEE Trans, on Pattern Analysis 
and Machine Intelligence, 6, 721-741. 

Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate 
Observations. New York: Wiley. 

GOOD, I. J.. (1956). Discussion of "Chance and control: Some implications of 
randomization" by G. Spencer Brown. In C. Cherry (Ed.), Information 
Theory: Third London Symposium (pp. 13-14). London: Butterworths. 

Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer- 
Verlag. 

Hall W. J., Wijsman, R. A., and Ghosh, J. K. (1965). The relationship 
between sufficiency and invariance with applications in sequential analysis. 
Annals of Mathematical Statistics, 36, 575-614. 

Halmos, P. R. (1950). Measure Theory. New York: Van Nostrand. 

Halmos, P. R. and Savage, L. J. (1949). Application of the Radon-Nikodym 
theorem to the theory of sufficient statistics. Annals of Mathematical Statis- 
tics, 20, 225-241. 

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. 
(1986). Robust Statistics: The Approach Based on Influence Functions. New 
York: Wiley. 

HARTIGAN, J. (1983). Bayes Theory. New York: Springer- Verlag. 

HEATH, D. and Sudderth, W. D. (1976). DeFinetti's theorem on exchangeable 
variables. American Statistician, 30, 188-189. 



682 References 



Heath, D. and Sudderth, W. D. (1989). Coherent inference from improper 
priors and from finitely additive priors. Annals of Statistics, 17, 907-919. 

Hewitt, E. and Savage, L. J. (1955). Symmetric measures on cartesian prod- 
ucts. Transactions of the American Mathematical Society, 80, 470-501. 

Heyde, C. C. and Johnstone, I. M. (1979). On asymptotic posterior normality 
for stochastic processes. Journal of the Royal Statistical Society (Series B), 
41, 184-189. 

Hill, B. M. (1965). Inference about variance components in the one-way model. 
Journal of the American Statistical Association, 60, 806-825. 

Hill, B. M., Lane, D., and Sudderth, W. D. (1987). Exchangeable urn pro- 
cesses. Annals of Probability, 15, 1586-1592. 

Hoel, P. G., Port, S. C, and Stone, C. J. (1971). Introduction to Probability 
Theory. Boston: Houghton Mifflin. 

Hogarth, R. M. (1975). Cognitive processes and the assessment of subjective 
probability distributions (with discussion). Journal of the American Statis- 
tical Association, 70, 271-294. 

Huber, P. J. (1964). Robust estimation of a location parameter. Annals of 
Mathematical Statistics, 35, 73-101. 

Huber, P. J. (1967). The behaviour of maximum likelihood estimates under 
nonstandard conditions. In L. M. LeCam and J. Neyman (Eds.), Pro- 
ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and 
Probability, volume 1 (pp. 221-233). Berkeley: University of California. 

HUBER, P. J. (1977). Robust Statistical Procedures. Philadelphia: Society for 
Industrial and Applied Mathematics. 

Huber, P. J. (1981). Robust Statistics. New York: Wiley. 

James, W. and Stein, CM. (1960). Estimation with quadratic loss. In J. Ney- 
man (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical 
Statistics and Probability, volume 1 (pp. 361-379). Berkeley: University of 
California. 

JAYNES, E. T. (1976). Confidence intervals vs. Bayesian intervals (with dis- 
cussion). In W. L. Harper and C. A. Hooker (Eds.), Foundations of 
Probability Theory, Statistical Inference, and Statistical Theories of Science 
(pp. 175-257). Dordrecht: D. Reidel. 

Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford: Oxford University 
Press. 

Johnstone, I. M. (1978). Problems in limit theory for martingales and posterior 
distributions from stochastic processes. Master's thesis, Australian National 
University. 

Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W., and Peters, 
S. C. (1980). Interactive elicitation of opinion for a normal linear model. 
Journal of the American Statistical Association, 75, 845-854. 

Kadane, J. B., Schervish, M. J., and Seidenfeld, T. (1985). Statistical 
implications of finitely additive probability. In P. Goel and A. Zellner 



References 683 



(Eds.), Bayesian Inference and Decision Techniques with Applications: Es- 
says in Honor of Bruno DeFinetti (pp. 59-76). Amsterdam: Elsevier Science 
Publishers. 

Kadane, J. B., Schervish, M. J., and Seidenfeld, T. (1996). Reasoning to a 
foregone conclusion. Journal of the American Statistical Association, 91, to 
appear. 

KAGAN, A. M., Linnik, Y. V., and Rao, C. R. (1965). On a characterization of 
the normal law based on a property of the sample average. Sankhya, Series 
A, 32, 37-40. 

KAHNEMAN, D., Slovic, P., and Tversky, A. (Eds.) (1982). Judgment Un- 
der Uncertainty: Heuristics and Biases. Cambridge: Cambridge University 
Press. 

KASS, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American 
Statistical Association, 90, 773-795. 

KASS, R. E. and Steffey, D. (1989). Approximate Bayesian inference in condi- 
tionally independent hierarchical models (parametric empirical Bayes mod- 
els). Journal of the American Statistical Association, 84, 717-726. 

Kass, R. E., Tierney, L., and Kadane, J. B. (1988). Asymptotics in Bayesian 
computation. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, 
and A. F. M. Smith (Eds.), Bayesian Statistics 3 (pp. 261-278). Oxford! 
Clarendon Press. 

Kass, R. E., Tierney, L., and Kadane, J. B. (1990). The validity of posterior 
expansions based on Laplace's method. In S. Geisser, J. S. Hodges, S. J. 
PRESS, and A. ZELLNER (Eds.), Bayesian and Likelihood Methods in Statis- 
tics and Econometrics (pp. 473-488). Amsterdam: Elsevier (North Holland). 

Keifer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood 
estimator m the presence of infinitely many incidental parameters. Annals 
of Mathematical Statistics, 27, 887-906. 

Kerridge, D. (1963). Bounds for the frequency of misleading Bayes inferences. 
Annals of Mathematical Statistics, 34, 1109-1110. 

Kinderman, A. J. and Monahan, J. F. (1977). Computer generation of ran- 
dom variables using the ratio of uniform deviates. ACM Transactions on 
Mathematical Software, 3, 257-260. 

Kingman J. F. C. (1978). Uses of exchangeability. Annals of Probability, 6, 
loo— 197. 

Knuth, D. E. (1984). The T$Xbook. Reading, MA: Addison-Wesley. 

KRAFT, C. H. (1964). A class of distribution function processes which have 

derivatives. Journal of Applied Probability, 1, 385-388. 
Krasker, W. and Pratt, J. W. (1986). Discussion of "On the consistency 

of Bayes estimates" by Diaconis and Freedman. Annals of Statistics, 14, 

55-58. 

KREM, A. (1963). On the independence in the limit of extreme and central 
order statistics. Publications of the Mathematical Institute of the Hungarian 
Academy of Science, 8, 469-474. 



684 References 



KSHIRSAGAR, A. M. (1972). Multivariate Analysis. New York: Marcel Dekker. 

Kullback, S. (1959). Information Theory and Statistics. New York: Wiley. 

Lamport, L. (1986). WFgK: A Document Preparation System. Reading, MA: 
Addison- Wesley. 

Lauritzen, S. L. (1984). Extreme point models in statistics (with discussion). 
Scandinavian Journal of Statistics, 11, 65-91. 

Lauritzen, S. L. (1988). Extremal Families and Systems of Sufficient Statistics. 
Berlin: Springer- Verlag. 

Lavine, M. (1992). Some aspects of Polya tree distributions for statistical mod- 
elling. Annals of Statistics, 20, 1222-1235. 

Lavine, M., Wasserman, L., and Wolpert, R. L. (1991). Bayesian inference 
with specified prior marginals. Journal of the American Statistical Associa- 
tion, 86, 964-971. 

Lavine, M., Wasserman, L., and Wolpert, R. L. (1993). Linearization of 
Bayesian robustness problems. Journal of Statistical Planning and Inference, 
37,307-316. 

LeCam, L. M. (1953). On some asymptotic properties of maximum likelihood 
estimates and related Bayes estimates. University of California Publications 
in Statistics, 1, 277-330. 

LeCam, L. M. (1970). On the assumptions used to prove asymptotic normality 
of maximum likelihood estimates. Annals of Mathematical Statistics, 41, 
802-828. 

Lecoutre, B. (1985). Reconsideration of the F-test of the analysis of variance: 

The semi-Bayesian significance test. Communications in Statistics — Theory 

and Methods, 14, 2437-2446. 
Lecoutre, B. and Rouanet, H. (1981). Deux structures statistiques fonda- 

mentales en analyse de la variance univariee et mulitvariee. Mathematiques 

et Sciences Humaines, 75, 71-82. 
Lehmann, E. L. (1958). Significance level and power. Annals of Mathematical 

Statistics, 29, 1167-1176. 
LEHMANN, E. L. (1983). Theory of Point Estimation. New York: Wiley. 
Lehmann, E. L. (1986). Testing Statistical Hypotheses (2nd ed.). New York: 

Wiley. 

Lehmann, E. L. and Scheffe, H. (1955). Completeness, similar regions and 
unbiased estimates. Sankhya, 10, 305-340. (Also 15, 219-236, and correction 
17, 250.) 

Lindley, D. V. (1957). A statistical paradox. Biometrika, 44, 187-192. 
Lindley, D. V. and Novick, M. R. (1981). The role of exchangeability in 

inference. Annals of Statistics, 9, 45-58. 
Lindley, D. V. and Phillips, L. D. (1976). Inference for a Bernoulli process 

(a Bayesian view). American Statistician, 30, 112-119. 
Lindley, D. V. and Smith, A. F. M. (1972). Bayes estimates for the linear 

model. Journal of the Royal Statistical Society (Series B), 34, 1-41. 



References 685 



LOEVE, M. (1977). Probability Theory I (4th ed.). New York: Springer- Verlag. 

Mauldin, R. D., Sudderth, W. D., and Williams, S. C. (1992). Polya trees 
and random distributions. Annals of Statistics, 20, 1203-1221. 

Mauldin, R. D. and Williams, S. C. (1990). Reinforced random walks and 
random distributions. Proceedings of the American Mathematical Society, 
110, 251-258. 

Mendel, G. (1866). Versuche iiber pflanzenhybriden. Verhandlungen Natur- 
forschender Vereines in Briinn, 10, 1. 

Metivier, M. (1971). Sur la construction de mesures aleatoires presque surement 
absolument continues par rapport a une mesure donnee. Zeitschrift fur 
Wahrscheinlichkeitstheorie, 20, 332-344. 

Morris, C. N. (1983). Parametric empirical Bayes inference: Theory and appli- 
cations (with discussion). Journal of the American Statistical Association, 
78, 47-65. 

Nachbin, L. (1965). The Haar Integral. Princeton: Van Nostrand. 

Neyman, J. (1935). Su un teorema concernente le cosiddette statistiche suffici- 
enti. Giomale DelVIstituto Italiano degli Attuari, 6, 320-334. 

Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient test 
of statistical hypotheses. Philosophical Transactions of the Royal Society of 
London, Series A, 231, 289-337. 

Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially 
consistent observations. Econometrica, 16, 1-32. 

Pearson, K. (1900). On the criterion that a given system of deviations from the 
probable in the case of a correlated system of variables is such that it can 
be reasonably supposed to have arisen from random sampling. Philosoph- 
ical Magazine (5thSeries), 50, 339-357. (See also correction, Philosophical 
Magazine (6thSeries), 1, 670-671.) 

Perlman, M. (1972). On the strong consistency of approximate maximum like- 
lihood estimators. In L. M. LeCam, J. NEYMAN, and E. L. SCOTT (Eds.), 
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and 
Probability, volume 1 (pp. 263-281). Berkeley: University of California. 

Pierce, D. A. (1973). On some difficulties in a frequency theory of inference. 
Annals of Statistics, 1, 241-250. 

Pitman, E. (1939). The estimation of location and scale parameters of a contin- 
uous population of any given form. Biometrika, 30, 391-421. 

PRATT, J. W. (1961). Review of "Testing Statistical Hypotheses" by E. L. 
Lehmann. Journal of the American Statistical Association, 56, 163-167. 

PRATT, J. W. (1962). Discussion of "On the foundations of statistical inference" 
by Allan Birnbaum. Journal of the American Statistical Association, 57, 
314-316. 

Rao, C. R. (1945). Information and the accuracy attainable in the estimation 
of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 
81-91. 



686 References 



Rao, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.). 
New York: Wiley. 

Robbins, H. (1951). Asymptotically subminimax solutions of compound sta- 
tistical decision problems. In J. Neyman (Ed.), Proceedings of the Second 
Berkeley Symposium on Mathematical Statistics and Probability (pp. 131— 
148). Berkeley: University of California. 

Robbins, H. (1955). An empirical Bayes approach to statistics. In J. Ney- 
man (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical 
Statistics and Probability, volume 1 (pp. 157-164). Berkeley: University of 
California. 

Robbins, H. (1964). The empirical Bayes approach to statistical decision prob- 
lems. Annals of Mathematical Statistics, 35, 1-20. 

Robert, CP. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3, 
601-608. 

Roberts, H. V. (1967). Informative stopping rules and inferences about popu- 
lation size. Journal of the American Statistical Association, 62, 763-775. 

Rouanet, H. and Lecoutre, B. (1983). Specific inference in ANOVA: From 
significance tests to Bayesian procedures. British Journal of Mathematical 
and Statistical Psychology, 36, 252-268. 

Royden, H. L. (1968). Real Analysis. London: Macmillan. 

Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130-134. 

RUDIN, W. (1964). Principles of Mathematical Analysis (2nd ed.). New York: 
McGraw-Hill. 

Savage, L. J. (1954). The Foundations of Statistics. New York: Wiley. 
Savage, L. J. (1962). The Foundations of Statistical Inference. London: 
Methuen. 

Scheffe, H. (1947). A useful convergence theorem for probability distributions. 
Annals of Mathematical Statistics, 18, 434-438. 

Schervish, M. J. (1983). User-oriented inference. Journal of the American 
Statistical Association, 78, 611-615. 

Schervish, M. J. (1992). Bayesian analysis of linear models (with discussion). 
In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith 
(Eds.), Bayesian Statistics 4: Proceedings of the Second Valencia Interna- 
tional Meeting (pp. 419-434). Oxford: Clarendon Press. 

Schervish, M. J. (1994). Discussion of "Bootstrap: More than a stab in the 
dark?" by G. A. Young. Statistical Science, 9, 408-410. 

Schervish, M. J. (1996). P-values: What they are and what they are not. 
American Statistician, 50, to appear. 

SCHERVISH, M. J. and Carlin, B. P. (1992). On the convergence of successive 
substitution sampling. Journal of Computational and Graphical Statistics, 
1, 111-127. 

Schervish, M. J. and Seidenfeld, T. (1990). An approach to consensus and 
certainty with increasing evidence. Journal of Statistical Planning and In- 
ference, 25, 401-414. 



References 687 



Schervish, M. J., Seidenfeld, T., and Kadane, J. B. (1984). The extent 
of non-conglomerability of finitely additive probabilities. Zeitschrift fur 
Wahrscheinlichkeitstheorie, 66, 205-226. 

Schervish, M. J., Seidenfeld, T., and Kadane, J. B. (1990). State dependent 
utilities. Journal of the American Statistical Association, 85, 840-847. 

Schwartz, L. (1965). On Bayes procedures. Zeitschrift fur Wahrscheinlichkeit- 
stheorie, 4, 10-26. 

Seidenfeld, T. and Schervish, M. J. (1983). A conflict between finite addi- 
tivity and avoiding Dutch Book. Philosophy of Science, 50, 398-412. 

Seidenfeld, T., Schervish, M. J., and Kadane, J. B. (1995). A representation 
of partially ordered preferences. Annals of Statistics, 23, 2168-2217. 

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. 
New York: Wiley. 

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica 
Sinica, 4, 639-650. 

Singh, K. (1981). On the asymptotic accuracy of Efron's bootstrap. Annals of 
Statistics, 9, 1187-1195. 

SMITH, A. F. M. (1973). A general Bayesian linear model. Journal of the Royal 
Statistical Society, Ser. B, 35, 67-75. 

SPJ0TVOLL, E. (1983). Preference functions. In P. J. Bickel, K. Dorsum, 
and J. L. Hodges, Jr. (Eds.), A Festschrift for Erich L. Lehmann (pp. 
409-432). Belmont, CA: Wadsworth. 

StatSci (1992). S-PLUS, Version 3.1 (software package). Seattle: StatSci Divi- 
sion, MathSoft, Inc. 

Stein, C. M. (1946). A note on cumulative sums. Annals of Mathematical 
Statistics, 17, 498-499. 

Stein, C. M. (1956). Inadmissibility of the usual estimator for the mean of 
a multivariate normal distribution. In J. Neyman (Ed.), Proceedings of 
the Third Berkeley Symposium on Mathematical Statistics and Probability, 
volume 1 (pp. 197-206). Berkeley: University of California. 

STEIN, C. M. (1965). Approximation of improper prior measures by prior proba- 
bility measures. In J. Neyman and L. M. LeCam (Eds.), Bernoulli, Bayes, 
Laplace: Anniversary Volume (pp. 217-240). New York: Springer- Verlag. 

Stein, CM. (1981). Estimation of the mean of a multivariate normal distribu- 
tion. Annals of Statistics, 9, 1135-1151. 

ST1GLER, S. M. (1986). The History of Statistics: The Measurement of Uncer- 
tainty before 1900. Cambridge, MA: Belknap. 

Stone, M. (1976). Strong inconsistency from uniform priors. Journal of the 
American Statistical Association, 71, 114-125. 

Stone, M. and DAWID, A. P. (1972). Un-Bayesian implications of improper 
Bayes inference in routine statistical problems. Biometrika, 59, 369-375. 



688 References 



Strasser, H. (1981). Consistency of maximum likelihood and Bayes estimates. 
Annals of Statistics, 9, 1107-1113. 

Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multi- 
variate normal mean. Annals of Mathematical Statistics, 42, 385-388. 

Taylor, R. L., Daffer, P. Z., and Patterson, R. F. (1985). Limit Theorems 
for Sums of Exchangeable Random Variables. Totowa, NJ: Rowman and 
Allanheld. 

Tierney, L. (1994). Markov chains for exploring posterior distributions (with 
discussion). Annals of Statistics, 22, 1701-1762. 

Tierney, L. and Kadane, J. B. (1986). Accurate approximations for poste- 
rior moments and marginal densities. Journal of the American Statistical 
Association, 81, 82-86. 

Venn, J. (1876). The Logic of Chance (2nd ed.). London: Macmillan. 

Verdinelli, I. and Wasserman, L. (1991). Bayesian analysis of outlier problems 
using the Gibbs sampler. Statistics and Computing, 1, 105-117. 

Von Mises, R. (1957). Probability, Statistics and Truth. London: Allen and 
Unwin. 

Von Neumann, J. and Morgenstern, O. (1947). Theory of Games and Eco- 
nomic Behavior (2nd ed.). Princeton: Princeton University Press. 

Wald, A. (1947). Sequential Analysis. New York: Wiley. 

Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. 
Annals of Mathematical Statistics, 20, 595-601. 

Wald, A. and Wolfowitz, J. (1948). Optimum character of the sequential 
probability ratio test. Annals of Mathematical Statistics, 19, 326-339. 

WALKER, A. M. (1969). On the asymptotic behaviour of posterior distributions. 
Journal of the Royal Statistical Society (Series B), 31, 80-88. 

WALLACE, D. L. (1959). Conditional confidence level properties. Annals of 
Mathematical Statistics, 30, 864-876. 

Welch, B. L. (1939). On confidence limits and sufficiency, with particular ref- 
erence to parameters of location. Annals of Mathematical Statistics, 10, 
58-69. 

WEST, M. (1984). Outlier models and prior distributions in Bayesian linear 
regression. Journal of the Royal Statistical Society (Series B), 46, 431-439. 

WlLKS, S. S. (1941). Determination of sample sizes for setting tolerance limits. 
Annals of Mathematical Statistics, 12, 91-96. 

Young, G. A. (1994). Bootstrap: More than a stab in the dark? (with discus- 
sion). Statistical Science, 9, 382-415. 

ZELLNER, A. (1971). An Introduction to Bayesian Inference in Econometrics. 
New York: Wiley. 



Notation and Abbreviation Index 



0 (vector of Os), 385 

1 (vector of Is), 345 

2 s (power set), 571 

«C (absolutely continuous), 574, 597 
a.e. (almost everywhere), 582 
ANCB(-r,-) (distribution), 668 
ANCx 2 (-rr) (distribution), 668 
ANCF(',;-) (distribution), 669 
AN OVA (analysis of variance), 384 
ARE (asymptotic relative efficiency), 
413 

a.s. (almost surely), 582 

Ax (cr-field generated by X), 51, 82 

N (action space), 144 

a (action space cr-field), 144 

\ (remove one set from another), 577 
B (closure of set), 622 
Ber(-) (distribution), 672 
Beta(' y -) (distribution), 669 
Bin(y) (distribution), 673 
B k (Borel afield), 576 
B (Borel cr-field), 575 

Cau(-,-) (distribution), 669 
CDF (cumulative distribution 

function), 612 
c' g (constant related to RHM), 367 
c g (constant related to LHM), 367 
xl (distribution), 669 
-+ (converges in distribution), 635 
— ► (converges in probability), 638 

— ► (converges weakly), 635 
Cov^(-, ) (conditional covariance 

given 6 = 0), 19 
Cov (covariance), 613 
Cv (cr-field on set of probability 

measures), 27 
A c (complement of set), 575 

Dir(-) (Dirichlet process), 54 



Dirk{. . •) (distribution), 674 
dii 2 /dm (Radon-Nikodym 

derivative), 575, 598 
A (symmetric difference), 581 

£*(•) (conditional mean given O = 0), 
19 

Exp(-) (distribution), 670 
E(-) (expected value), 607, 613 
E( |-) (conditional mean), 616 

/ + (positive part), 588 
/" (negative part), 588 
fx\e (conditional density of X given 
6), 13 

fx\Y (conditional density), 13 
F q , a (distribution), 670 

r _1 (->-) (distribution), 670 
r(-,-) (distribution), 670 
Geo(-) (distribution), 673 

HPD (highest posterior density), 327 
Hyp('rr) (distribution), 673 

IID (independent and identically 

distributed), 611 
Ik (identity matrix), 643 
^x|r( ;-|*) (conditional Kullback- 

Leibler information), 

115 

2"x|t( | ) (conditional Fisher 

information), 111 
Xx( ) (Fisher information), 111 
Jx(-;0 (Kullback-Leibler 

information), 115 
Ia(-) (indicator function), 9 

Lap(-,«) (distribution), 670 
X g (measure constructed from LHM), 
367 

LHM (left Haar measure), 363 
LMP (locally most powerful), 245 
LMVUE (locally minimum variance 
unbiased estimator), 300 



690 Notation and Abbreviation Index 



LR (likelihood ratio), 274 

MC (most cautious), 230 
MLE (maximum likelihood 

estimator), 307 
MLR (monotone likelihood ratio), 

239 

MP (most powerful), 230 
MRE (minimum risk equivariant), 
347 

Multk(> ••) (distribution), 674 
Me|x('|') (posterior distribution), 16 

NCB(>, •, •) (distribution), 671 
NCx 2 q {') (distribution), 671 
JVCF(v,0 (distribution), 671 
NCta(-) (distribution), 671 
iVCTa(s-) (CDF of NCt 

distribution), 671 
Negbin(-,-) (distribution), 673 
v (dominating measure), 13 
iV(-,.) (distribution), 671 
iVp(-,-) (distribution), 674 
M (integers plus oo), 537 

Q (parameter space), 13, 82 
op (stochastic small order), 396 
Op (stochastic large order), 396 
o (small order), 394 
O (large order), 394 

Vq (parametric family), 50, 82 
Par(v) (distribution), 672 
dB (boundary of set), 636 
$(•) (CDF of normal distribution), 
672 

P n (empirical probability measure), 
12 

Poi(-) (distribution), 673 

Pr(-) (probability), 612 

Pr(-| ) (conditional probability), 617 

Po,t(') (conditional distribution of T 

given 0 = 0), 84 
P' e (.) (conditional probability given 

0 = 0), 51, 83 
P e (>) (conditional distribution given 

9 = 0), 51, 83 
P (random probability measure), 25 
V (set of all probability measures), 27 



Q n ('\x) (conditional distribution 

given n observations), 539 

IR (real numbers), 570 
IR + (positive reals), 571 
IR +0 (nonnegative reals), 627 
p g (measure constructed from RHM), 
367 

RHM (right Haar measure), 363 
r(r/,6) (Bayes risk), 149 
R(6,6) (risk function), 149 

(5, A, n) (measure space), 577 
SPRT (sequential probability ratio 

test), 549 
SSS (successive substitution 

sampling), 507 

r (parameter space a-field), 13 
0' (parametric index), 50 
0 (parameter), 51 
T (statistic), 84 

To(-) (CDF of t distribution), 672 
t a (',') (distribution), 672 
v T (transpose of vector), 614 

UMA (uniformly most accurate), 317 
UMAU (uniformly most accurate 

unbiased), 321 
UMC (uniformly most cautious), 230 
UMCU (uniformly most cautious 

unbiased), 254 
UMP (uniformly most powerful), 230 
UMPU (uniformly most powerful 

unbiased), 254 
UMPUAI (uniformly most powerful 

unbiased almost invariant), 

384 

UMVUE (uniformly minimum 
variance unbiased 
estimator), 297 

USC (upper semicontinuous), 417 

[/(., ) (distribution), 672 

Var^O, •) (conditional variance given 

0 = 6), 19 
Var (variance), 613 

x (element of sample space), 82 
X (sample space), 13, 82 



Name Index 



Ahlfors, L., 667, 675 
Aitchison, J., 325, 675 
Albert, J., 519, 675 
Aldous, D., 46, 79, 482, 675 
Anderson, T., 386, 675 
Andrews, C, ix 
Anscombe, F., 181, 675 
Antoniak, C, 59, 675 
Aumann, R., 181, 675 

Bahadur, R., 94, 675 

Barnard, G., 320, 420, 675 

Barndorff-Nielsen, O., 307, 675 

Barnett, V., vii, 675 

Barron, A., 434-435, 446, 675, 677 

Basil, D., 99-100, 675 

Bayes, T., 16, 29, 676 

Becker, R., x, 676 

Berberian, S., 507, 667, 676 

Berger, J., 22, 173, 284, 525, 565, 

614, 666, 676 
Berger, R., 283, 677 
Berk, R., 417, 430, 432, 676 
Berkson, J., 218, 281, 676 
Berry, D., 565, 676 
Berti, P., 21, 676 
Bhattacharyya, A., 305 
Bickel, P., 330-331, 676 
Biilingsley, P., 46, 621, 636, 648, 676 
Bishop, Y., 462, 676 
Blackwell, D., 56, 86, 152, 455, 676 
Blyth, C, 158, 676 
Bohrer, R., x 
Bondar, J., 236, 676 
Bortkiewicz, L., 462, 677 
Box, G., 21, 521, 677 
Breiman, L., 618, 640, 677 
Brenner, D., 435, 677 
Brown, L., 99, 160, 167, 677 
Buck, C., 665, 677 
Buehler, R., 99, 677 

Carlin, B., 507, 686 
Casella, G., 283, 677 



Chaloner, K., 24, 677 

Chambers, J., x, 676 

Chang, T., 379, 677 

Chapman, D., 303, 677 

Chen, C, 435, 677 

Chib, S., 519, 675 

Chow, Y., 647, 677 

Church, T., 24, 677 

Churchill, R., 666-667, 677 

Clarke, B., 446, 677 

Cornfield, J., 563, 565, 678 

Cox, D., 21, 218, 424, 521, 677-678 

Cramer, H., 301, 678 

Daffer, P., 33, 688 
David, H., 404, 678 
Dawid, A., 21, 125, 435, 521, 678, 
687 

DeFinetti, B., ix, 6, 21, 25, 28, 654, 

656-657, 678 
DeGroot, M., ix, 91, 98, 181, 362, 

536, 654, 678 
DeMoivre, A., 8, 678 
Diaconis, P., ix, 15, 28, 41, 46, 108, 

123, 126, 426, 434, 479-480, 

667, 678-680 
Dickey, J., 24, 679, 681-682 
Doob, J., 36, 429, 507, 645-646, 679 
Doytchinov, B., ix 
Dubins, L., 70, 455, 676, 679 
Dugundji, J., 666, 679 
Dunford, N., 507, 635, 667, 679 
Dunsmore, I., 325, 675 

Eberhardt, K., 326, 679 
Eddy, W., 521, 681 
Edwards, W., 222, 284, 679 
Efron, B., 166, 330-331, 335-336, 423, 
679 

Escobar, M., 60, 679 

Fabius, J., 61, 679 
Fedderson, A., 99, 677 



692 Name Index 



Ferguson, T., 52, 56, 61, 173, 179, 
181,248,258,614,666, 680 

Ferrandiz, J., 669, 680 

Fieller, E., 321, 680 

Fienberg, S., 462, 676 

Fishburn, P., 181, 680 

Fisher, R., 89, 96, 217-218, 307, 370, 
373, 522, 680 

Fraser, D., 435, 677, 680 

Freedman, D., 15, 28, 40-41, 46, 61, 
70, 123, 126, 330-331,426, 
434, 479-480, 667, 676, 678- 
680 

Freedman, L., 24, 680 
Freeman, P., 524, 681 

Gabriel, K., 252, 681 
Garthwaite, P., 24, 681 
Gavasakar, U., 24, 681 
Geisser, S., 521, 668, 681 
Gelfand, A., 507, 681 
Geman, D., 507, 681 
Geman, S., 507, 681 
Ghosh, J., 381-382, 681 
Gnanadesikan, R., 22, 681 
Good, I., 565, 681 

Hadjicostas, P., ix 
Hall, P., 337-338, 681 
Hall, W., 381-382, 681 
Halmos, P., 364, 600, 681 
Hampel, F., 315, 681 
Hartigan, J., 20-21, 33, 681 
Heath, D., 21, 46, 681-682 
Hewitt, E., 46, 682 
Heyde, C, 435, 682 
Hill, B., 9, 484, 682 
Hinkley, D., 218, 423, 678-679 
Hodges, J., 414 
Hoel, P., 640, 682 
Hogarth, R., 24, 682 
Holland, P., 462, 676 
Huber, P., 310, 315, 428, 682 
Hwang, J., 160, 677 

James, W., 163, 682 
Jaynes, E., 379, 682 
Jeffreys, H., 122, 229, 284, 682 
Jiang, T., ix 



Johnstone, I., 435, 682 

Kadane, J., 21, 24, 183-184, 446, 564, 

655, 682-683, 687-688 
Kagan, A., 349, 683 
Kahneman, D., 23, 683 
Kass, R., ix, 226, 446, 505, 683 
Kerridge, D., 564, 683 
Kiefer, J., 417, 420, 683 
Kinderman, A., 660, 683 
Kingman, J., 36, 683 
Knuth, D., x, 683 
Kraft, C, 66, 683 
Krasker, W., 56, 683 
Krem, A., 408, 683 
Kshirsagar, A., 386, 684 
Kullback, S., 116, 684 

Lamport, L., x, 684 
Lane, D., 9, 682 
Lauritzen, S., 28, 123, 481, 684 
Lavine, M., 69, 526, 684 
LeCam, L., 414, 437, 684 
Lecoutre, B., 668-669, 684, 686 
Lehmann, E., 231, 280, 285, 298, 350, 
684 

Levy, P., 648, 650 

Lindley, D., 6, 229, 284, 479, 684 

Lindman, H., 222, 284, 679 

Linnik, Y., 349, 683 

Loeve, M., 34, 653, 685 

Louis, T., 24, 677 

Matts, J., 24, 677 
Mauldin, R., 66, 69, 685 
McDunnough, P., 435, 677, 680 
Mee, R., 326, 679 
Mendel, G., 217, 685 
Metivier, M, 66, 685 
Monahan, J., 660, 683 
Morgenstem, O., 181-182, 688 
Morris, C, 166, 500, 679, 685 

Nachbin, L., 364, 685 
Neyman, J., 89, 175, 231, 247, 420, 
685 

Nobile, A., ix 
Novick, M., ix, 6, 684 

Que, S., ix 



Name Index 693 



Patterson, R., 33, 688 

Pearson, E., 175, 231, 247, 685 

Pearson, K., 216, 685 

Perlman, M., 430, 685 

Peters, S., 24, 682 

Phillips, L., 6, 684 

Pierce, D., 99, 685 

Pitman, E., 347, 685 

Port, S., 640, 682 

Portnoy, S., x 

Pratt, J., 56, 98, 683, 685 

Raftery, A., 226, 683 

Ramamoorthi, R., 86, 676 

Rao, C, 152, 301, 349, 683, 685-686 

Reeve, C, 326, 679 

Regazzini, E., 21, 676 

Rigo, P., 21, 676 

Robbins, H., 303, 647, 677, 686 

Robert, C, 225, 686 

Roberts, H., 565, 686 

Ronchetti, E., 315, 681 

Rouanet, H., 668-669, 684, 686 

Rousseeuw, P, 315, 681 

Royden, H., 578, 589, 597, 621, 686 

Rubin, D., 332, 686 

Rudin, W., 666, 686 

Savage, L., 46, 181, 222, 284, 565, 
600, 679, 681-682, 686 

Scheffe, H., 298, 634, 684, 686 

Schervish, M., v 

Schwartz, J., 507, 635, 667, 679 

Schwartz, L., 429, 687 

Scott, E., 420, 685 

Seidenfeld, T., ix, 21, 183-184, 187, 
429, 564, 655, 682-683, 
686-687 

Sellke, T., 284, 676 

Serfling, R., 413, 687 

Sethuraman, J., 56, 687 

Short, T., ix 

Shurlow, N., v 

Siegmund, D., 647, 677 

Singh, K., 331, 687 

Slovic, P., 23, 683 



Smith, A., 479, 507, 681, 684, 687 
Smith, W., 24, 682 
Spiegelhalter, D., 24, 680 
Spj0tvoll, E., 283, 687 
Stahel, W., 315, 681 
Steffey, D., 505, 683 
Stein, C, 163, 379, 382, 568, 682, 687 
Stigler, S., 8, 687 
Stone, C, 640, 682 
Stone, M., 21, 678, 687 
Strasser, H., 430, 688 
Strawderman, W., ix, 166, 688 
Sudderth, W., 9, 21, 46, 66, 69, 
681-682, 685 

Taylor, R., 33, 688 

Tiao, G., 521, 677 

Tibshirani, R., 336, 679 

Tierney, L., 225, 446, 507, 683, 688 

Tversky, A., 23, 683 

Venn, J., 8, 688 

Verdinelli, I., 524, 688 

Villegas, C, 379, 677 

Von Mises, R., 10, 688 

Von Neumann, J., 181-182, 688 

Wald, A., 415, 549, 552, 557, 688 

Walker, A., 435, 442, 688 

Wallace, D., 99, 688 

Wasserman, L., ix, 524, 526, 684, 688 

Welch, B., 320, 688 

West, M., 524, 688 

Wijsman, R., x, 381-382, 681 

Wilks, A., x, 676 

Wilks, S., 325, 688 

Williams, S., 66, 69, 685 

Winkler, R., 24, 682 

Wolfowitz, J., 417, 420, 557, 683, 688 

Wolpert, R., 526, 684 

Ylvisaker, D., 108, 679 
Young, G., 329, 688 

Zellner, A., 16, 688 
Zidek, J., 21, 678 



Subject Index* 



Abelian group, 353 

Absolutely continuous, 574, 597, 668 

Absolutely continuous function, 211 

Accept hypothesis, 214 

Acceptance-rejection, 659 

Action space, 144 

Admissible, 154-157, 162, 167, 174 

A, i5^-156, 162 
Almost everywhere, 572, 582 
Almost invariant function, 383 
Almost surely, 572, 582 
Alternate noncentral beta 
distribution, 668 
Alternate noncentral x 2 distribution, 
668 

Alternate noncentral F distribution, 
669 

Alternative, 2, 214 

composite, 215 

simple, 215, 233 
Analysis of variance, 384, 491 
Analytic function, 105 
Ancillary statistic, 95, 99, 119 

maximal, 97 
ANOVA, 384, 491 
Archemedian condition, 192 
ARE, 413 

Asymptotic distribution, 399 
Asymptotic efficiency, 41$ 
Asymptotic relative efficiency, 4^3 
Asymptotic variance, 402 
Autoregression, 141 
Autoregressive process, 441 
Axioms of decision theory, 183-184, 
296 

Backward induction, 537 
Bahadur's theorem, 94 
Base measure, 54 
Base of test, 215-216 



* Italicized page numbers indicate 
where a term is defined. 



Basu's theorem, 99 

Bayes factor, 221, 238, 262-263, 274 

Bayes risk, 149 

Bayes rule, 150, 154-155, 167-168, 
178 

extended, 169 

formal, 146, 150, 157, 348, 351, 
369 

generalized, 156-157 

partial, 147, 150 
Bayes' theorem, 4, 16 
Bayesian bootstrap, 332 
Bernoulli distribution, 672 
Beta distribution, 54, 669 
Bhattacharyya lower bounds, 305 
Bias, 296 

Bimeasurable function, 572, 583, 618 
Binomial distribution, 673 
Bolzano- Weierstrass theorem, 666 
Bootstrap, 329 

Bayesian, 332 

nonparametric, 329 

parametric, 330 
Borel afield, 571, 575 
Borel space, 609, 618 
Borel-Cantelli lemma: 

first, 578 

second, 663 
Boundary, 636 

Boundedly complete statistic, 94, 99 
Box-Cox transformations, 521 

Called-off preference, 184 

Carat heodory extension theorem, 578 

Cauchy distribution, 669 

Cauchy sequence, 619 

Cauchy's equation, 667 

Cauchy-Schwarz inequality, 615 

CDF, 612 

empirical, ^-405, 408 
Central limit theorem, 642 

multivariate, 643 
Chain rule, 600 

Chapman-Robbins lower bound, 304 



Subject Index 695 



Characteristic function, 611, 639 
Chi-squared distribution, 669 
Chi-squared test of independence, 467 
Closed set, 622 
Closure, 622 
Coherent tests, 252 
Complete class, 174 

essentially, 174, 244, 251, 256 
minimal, 174 

minimal, 174-175 
Complete class theorem, 179 
Complete measure space, 579, 603 
Complete metric space, 619 
Complete statistic, P^, 298 

boundedly, 94, 99 
Composite alternative, 215 
Composite hypothesis, 215 
Conditional distribution, 13, 16, 607, 
609, 617 

regular, 610, 618 

version, 617 
Conditional expectation, 19, 607, 616 

version, 608, 616 
Conditional Fisher information, 111, 
119 

Conditional independence, 9, 610, 628 
Conditional Kullback-Leibler 

information, 115, 119 
Conditional mean, 607, 616 

version, 616 
Conditional preference, 185 

consistent, 186 
Conditional probability, 607, 609, 617 

regular, 609, 617 
Conditional score function, 111 
Conditionally sufficient statistic, 95 
Confidence coefficient, 315, 325 
Confidence interval, 3 

fixed- width, 559 

sequential, 559 
Confidence sequence, 569 
Confidence set, 279, 315, 379 

conservative, 315 

exact, 315 

randomized, 316 

UMA, 317 

UMAU, 321 
Conjugate prior, 92 
Conservative confidence set, 315 



Conservative prediction set, 324 
Conservative tolerance set, 325 
Consistent, 397, 412 
Consistent conditional preference, 186 
Consistent distributions, 652 
Contingency table, 467 
Continuity axiom, 184 
Continuity theorem, 640 
Continuous distribution, 612 
Continuous mapping theorem, 638 
Convergence: 

pointwise, 184 

weak, 399, 635 
Convergence in distribution, 399, 611, 
635 

Convergence in probability, 396, 611, 
638 

Convex function, 614 
Counting measure, 570 
Covariance, 607, 613 
Cramer-Rao lower bound, 301 

multiparameter, 306 
Credible set, 327 

Cumulative distribution function (see 

CDF), 612 
Cylinder set, 652 

Data, 82 

Decide optimally after stopping, 540 
Decision rule, 145 
maximum, 541 

nonrandomized, 145, 151, 153 

nonrandomized sequential, 537 

randomized, 145, 151 

randomized sequential, 537 

regular, 540-541 

sequential, 537 

nonrandomized, 537 
randomized, 537 

terminal, 537 

truncated, 542 
Decision theory, 144, 181 

axioms, 183-184 
Decreasing sequence of sets, 577 
DeFinetti's representation theorem, 
28 

Degenerate exponential family, 104 
Degenerate weak order, 183 
Delta method, 401, 464, 466 



696 Subject Index 



Dense, 619 

Density, 607, 613 

Dirichlet distribution, 52, 54, 674 

Dirichlet process, 52, 54, 332, 434 

Discrete distribution, 612 

Distribution: 

alternate noncentral beta, 668 

alternate noncentral x 2 > 668 

alternate noncentral F, 669 

asymptotic, 399 

Bernoulli, 672 

beta, 54, 669 

binomial, 673 

Cauchy, 669 

chi-squared, 669 

conditional, 13, 16 

consistent, 652 

continuous, 612 

Dirichlet, 52, 54, 674 

discrete, 612 

empirical, 12, 38 

exponential, 670 

F, 670 

fiducial, 370, 373 
gamma, 670 
geometric, 673 
half- normal, 389 
hyper geometric, 673 
inverse gamma, 670 
Laplace, 670 
least favorable, 168 
marginal, 14 
multinomial, 674 
multivariate normal, 643, 674 
negative binomial, 673 
noncentral beta, 289, 671 
noncentral , 671 
noncentral F, 289, 671 
noncentral t, 289, 325, 671 
normal, 21, 349, 611, 640, 642, 
671 

multivariate, 643, 674 
Pareto, 672 
Poisson, 673 
posterior, 16 
predictive, 14 

posterior, 18 

prior, 14 
prior, 13 



improper, 20 
t, 672 

uniform, 659, 672 
Distribution function (see CDF), 612 
Dominance axiom, 185 
Dominated convergence theorem, 591 
Dominates, 154 
Dominating measure, 574, 597 
Dutch book, 656 

Efficiency: 

asymptotic, 413 

asymptotic relative, 413 

second-order, 414 
Elicit ation of probabilities, 22-23 
Empirical Bayes, 166, 420, 500 
Empirical CDF, 404-405, 408 
Empirical distribution, 12, 38 
Empirical probability measure, 12 
e-contamination class, 524, 526, 528 
Equal- tailed test, 263 
Equivalence class, 140 
Equivalence relation, 140 
Equi variant rule, 357 

location, 346-347, 351 

minimum risk (see MRE), 347 

scale, 350 
Essentially complete class, 174, 244, 
251, 256 

minimal, 174 
Estimator, 3, 296 

maximum likelihood, 3, 307 

MRE, 347, 351, 363 

Pitman, 347, 363 

point, 3, 296 

unbiased, 3, 296 
Event, 606, 612 
Exact confidence set, 315 
Exact prediction set, 324 
Exchangeable, 7, 27-28 

partially, 125, 4™ 

row and column, 4$% 
Expectation, 607, 613 

conditional, 19, 616 
Expected Fisher information, 423 
Expected loss principle, 146, 181 
Expected value (see Expectation), 
613 

Exponential distribution, 670 



Subject Index 697 



Exponential family, 102-103, 105, 
109, 155, 239, 249 

degenerate, 104 

nondegenerate, 104 
Extended Bayes rule, 169 
Extended real numbers, 571 
Extremal family, 123, 125 

F distribution, 670 
Fatou's lemma, 589 
FI regularity conditions, 111 
Fiducial distribution, 370, 373 
Field, 571, 575 

Finite population sampling, 74 
Finitely additive probability, 21, 281, 
564, 657 

Fisher information, 111, 113, 301, 
412, 463 

conditional, 111, 119 

expected, 4%3 

observed, 226, 424, 435 
Fisher-Neyman factorization 

theorem, 89 
Fixed point, 505 
Fixed-point problem, 505 
Fixed- width confidence interval, 559 
Floor of test, 2^216 
Formal Bayes rule, 146, 150, 157, 348, 

351, 369 
Fubini's theorem, 596 
Function: 

absolutely continuous, 211 

bimeasurable, 583 

measurable, 572, 583 

simple, 586 

Gamma distribution, 670 
General linear group, 354 
Generalized Bayes rule, 1 56-157, 159 
Generalized Neyman-Pearson lemma, 
247 

Generated cr-field, 571-572, 584 
Geometric distribution, 673 
Gibbs sampling, 507 
Goodness of fit test, 218, 461 
Gross error sensitivity, 312 
Group, 353, 355-356 

abelian, 353 

general linear, 354 



location, 354 

location-scale, 354, 357, 368 
permutation, 355 
scale, 354 

Haar measure: 
left, 363 

related, 366 
right, 363 
related, 366 
Hahn decomposition theorem, 605 
Half-normal distribution, 389 
Hierarchical model, 166, 476 
Highest posterior density region (see 

HPD), 327 
Hilbert space, 507 

Hilbert-Schmidt-type operator, 507, 
667 

Horse lottery, 182 

Hotelling's T 2 , 388 

HPD region, 327, 329, 343 

Hypergeometric distribution, 673 

Hyperparameters, ^77 

Hypothesis, 2, 214 

composite, 215 

one-sided, 24 1 

simple, 215, 233 
Hypothesis test, 2 

predictive, 219, 325 

randomized, 3 
Hypothesis-testing loss, 214 

Identity element of group, 353 
Ignorable statistic, 142 
IID, 2, 8, 611, 628 

conditionally, 9-10, 83, 611, 628 
Image sigma field, 584 
Importance sampling, 403, 661 
Improper prior, 20, 122, 223, 263 
Inadmissible, 154 
Increasing sequence of sets, 577 
Independence, 610, 628 

conditional, 9, 610, 628 
Indifferent, 183 
Induced measure, 575, 601 
Infinitely often, 578, 663 
Influence function, 311 
Information: 

Fisher, 111, 113, 463 



698 Subject Index 



Kullback-Leibler, 115-116 
Integrable, 588 

uniformly, 592 
Integral, 573, 587-588 

over a set, 588 
Invariance of distributions, 355 
Invariant function, 357 

almost, 383 

location, 346 

maximal, 358 

scale, 350 
Invariant loss, 356 

location, 346 

scale, 350-351 
Invariant measure, 363 
Inverse function theorem, 666 
Inverse gamma distribution, 670 
Inverse of group element, 353 

Jacobian, 625 

James-Stein estimator, 163, 486 
Jeffreys' prior, 122, 446 
Jensen's inequality, 614 

Kolmogorov zero-one law, 631 
Kullback-Leibler divergence, 116 
Kullback-Leibler information, 
ii5-116 
conditional, 115, 119 

Levy's theorem, 648, 650 
A-admissible, 1 5^-156, 162 
Laplace approximation, 226, 446 
Laplace distribution, 670 
Large order, 394 

stochastic, 396 
Law of large numbers: 

strong, 34-36 

weak, 642 
Law of the unconscious statistician, 

607, 613 
Law of total probability, 632 
Least favorable distribution, 168 
Lebesgue measure, 571, 580 
Left Haar measure, 363 

related, 366 
Lehmann-Scheffe theorem, 298 
L-estimator, 41® 
Level of test, 215-216 



LHM, 363 

Likelihood function, 2, 13, 307 
Likelihood ratio test (see LR test), 
274 

Linear regression, 276, 321 
LMP test, 245, 265, 289 
LMPU test, 265, 292 
LMVUE, 300 

Locally minimum variance unbiased 

estimator, 300 
Locally most powerful test (see 

LMP), 245 
Location equivariant rule, 346 
Location estimation, 346 
Location group, 354 
Location invariant function, 346 
Location invariant loss, 346 
Location parameter, 344 
Location-scale group, 354 
Location-scale parameter, 345 
Look-ahead decision rule, 546 
Loss function, 144, 162, 189, 296 

convex, 349 

hypothesis-testing, 214 

squared-error, 146, 297 

0-1, 215 

0-1-c, 215, 218 
Lower boundary, 170, 179, 233-235, 
287 

LR test, 223, 273-27^, 458-459 

Marginal distribution, 14, 607 
Marginalization paradox, 21 
Markov chain, 15, 507, 650 
Markov chain Monte Carlo, 507 
Markov inequality, 614 
Martingale, 645-646 

reversed, 33, 649 
Martingale convergence theorem, 

648-649 
Maximal ancillary, 97 
Maximal invariant, 358 
Maximin strategy, 168 
Maximin value, 168 
Maximum likelihood estimator, 3, 

307, 415, 418-421 
Maximum modulus theorem, 667 
Maximum of decision rules, 541 
MC test, 230 



Subject Index 699 



Mean, 607, 613 

conditional, 616 

trimmed, 314 
Measurable function, 572, 583 
Measure, 570, 572, 575, 577 

induced, 601 

Lebesgue, 571, 580 

product, 595 

a-finite, 572, 578, 601 

signed, 577, 597 
Measure space, 572, 577 
M-estimator, 313~3lb, 424-428, 434 
Method of Laplace, 226, 44$ 
Method of moments, 340 
Mill's ratio, 470 

Minimal complete class, 17^-175 
Minimal essentially complete class, 

m 

Minimal sufficient statistic, 92 
Minimax principle, 167, 189 
Minimax rule, 1 57-169 
Minimax theorem, 172 
Minimax value, 168 
Minimum risk equivariant (see MRE), 
347 

MLE, 3, 307, 415, 418-421 
MLR, 239-244 

Monotone convergence theorem, 590 
Monotone likelihood ratio, 239-244 
Monotone sequence of sets, 577 
Most cautious test, 230 
Most powerful test, 230 
MP test, 230 
MRE, 347, 349, 351, 363 
Multinomial distribution, 674 
Multiparameter Cramer-Rao lower 

bound, 306 
Multivariate central limit theorem, 

643 

Multivariate normal distribution, 643, 
674 

Natural parameter, 103, 105 
Natural parameter space, 103, 105 
Natural sufficient statistic, 103 
Negative binomial distribution, 673 
Negative part, 573, 588 
Negative set, 598 
Neyman structure, 266 



Neyman-Pearson fundamental 

lemma, 175, 231 
NM-lottery, 182 

Noncentral beta distribution, 289, 
671 

Noncentral X* distribution, 671 
Noncentral F distribution, 289, 671 
Noncentral t distribution, 289, 325, 
671 

Nondegenerate exponential family, 
104 

Nondegenerate weak order, 183 
Nonnull states, 184 
Nonparametric, 52 
Nonparametric bootstrap, 329 
Nonrandomized decision rule, 145, 
151, 153 

Nonrandomized sequential decision 

rule, 557 
Normal distribution, 21, 349, 611, 

640, 642, 671 
multivariate, 643, 674 
Null states, 184 

Observed Fisher information, 226, 

424, 435 
One-sided hypothesis, 24 1 
One-sided test, 239, 243 
Open set, 57i 

Operating characteristic, 215 
Orbit, 358 
Order statistics, 86 
Outliers, 521 

Parameter, 1, 6, 50-51, 82 

location, 344 

location-scale, 345 

natural, 103, 105 

scale, 345 
Parameter space, 1, 50, 82 

natural, 103, 105 
Parametric bootstrap, 330 
Parametric family, 1, 50, 102 
Parametric index, 33, 50 
Parametric models, 12 
Parametric Models, 49 
Pareto distribution, 672 
Partial Bayes rule, 147, 150 
Partially exchangeable, 125, 479 



700 Subject Index 



Percentile-t bootstrap confidence 

interval, 336 
Permutations, 355 
7r-A theorem, 576 
Pitman's estimator, 347, 363 
Pivotal, 316, 370, 373 
Point estimation, 296 
Point estimator, 296 
Pointwise convergence, 184 
Poisson distribution, 673 
Polish space, 619 
Polya tree distribution, 69 
Polya urn scheme, 9 
Portmanteau theorem, 636 
Positive part, 573, 588 
Positive set, 598 
Posterior distribution, 4, 16 

asymptotic normality, 435, 437, 
442-443 

consistency, 429-430 
Posterior predictive distribution, 18 
Posterior risk, 146, 150 
Power function, 2, 215, 240 
Power set, 571 
Prediction set, 324-325 

conservative, 324 

exact, 324 
Predictive distribution, i^, 455 

posterior, 18 

prior, 14 

Predictive hypothesis test, 219, 325 

Preference, 182 

conditional, 185 
consistent, 186 

Prevision, 655 

Prior distribution, 4, 13 
improper, 20, 223, 263 
natural conjugate family, 92 

Prize, 181 

Probability, 572, 577 

empirical, 12 

random, 27 
Probability integral transform, 519, 
659 

Probability space, 572, 577, 606, 612 
Product measure, 595 
Product afield, 576 
Product space, 576 
Pseudorandom numbers, 659 



Pure significance test, 217 
P-value, 279, 375, 380 

Quantile: 

sample, 404-405, 408 

Radon-Nikodym derivative, 575, 598 
Radon-Nikodym theorem, 597 
Random probability measure, 27 
Random quantity, 82, 606, 612 
Random variables, 606, 612 

exchangeable, 27 

IID, 8 

Randomized confidence set, 316 
Randomized decision rule, 145, 151 
Randomized sequential decision rule, 
537 

Randomized test, 3 
Rao-Blackwell theorem, 152 
Ratio of uniforms, 660 

Regression, 276, 321, 519 

Regular conditional distribution, 610, 
618 

Regular conditional probabilities, 

609, 617 
Regular decision rule, 540 
Reject hypothesis, 214 
Rejection region, 2 
Related LHM, 366 
Related RHM, 366 
Relative rate of convergence, ^15, 470 
Restriction of a-field, 584 
Reversed martingale, 649 
RHM, 363 

Right Haar measure, 363 

related, 366 
Risk function, U9-150, 153, 155, 167, 

216, 233, 297-298 
Risk set, 1 70-172, 179, 233, 235, 287 
Robustness, 310 

Bayesian, 524 
Row and column exchangeable, 4^2 

Sample quantile, ^-405, 408 

Sample space, 2, 82 

Scale equivariant rule, 350 

Scale estimation, 350 

Scale group, 354 

Scale invariant function, 350 



Subject Index 701 



Scale invariant loss, 350-351 
Scale parameter, 345 
Scheffe's theorem, 634 
Score function, 111, 122, 302, 305 

conditional, 111 
Second-order efficiency, 414 
Sensitivity analysis, 524 
Separable space, 619 
Separating hyperplane theorem, 666 
Sequential decision rule, 537 
Sequential probability ratio test, 549 
Sequential test, 548 
Set estimation, 296 
Shrinkage estimator, 163 
cr-field, 575 

Borel, 571, 575 

generated, 571-572, 584 

image, 584 

restriction, 584 

tail, 632 
(T-finite measure, 572, 578, 601 
Signed measure, 577, 597, 605, 635 
Significance probability, 217, 228, 280 
Significance test, 217 
Simple alternative, 215 
Simple function, 586 
Simple hypothesis, 215 
Size of test, 2, 215-216 
Small order, 394 

stochastic, 396 
SPRT, 549 

Squared-error loss, 146, 297 
\/n-consistent, 401 
SSS, 507 

St. Petersburg paradox, 655 
State independence, 184, 205 
State-dependent utility, 205-206 
States of Nature, 181, 189, 205 
Statistic, 83 

ancillary, 95, 99, 119 

boundedly complete, 94 

complete, 94, 298 

sufficient, 84-85-86, 99, 103, 
150-151, 298 
Stein estimator (see James-Stein 

estimator), 163 
Stochastic large order, 396 
Stochastic small order, 396 
Stone- Weierstrass theorem, 666 



Stopping time, 537, 548, 552, 554 
Strict preference, 183 
Strong law of large numbers, 34-36 
Strongly unimodal, 329 
Submartingale, 646 
Successive substitution, 505-500, 545 
Successive substitution sampling, 507 
Sufficient statistic, 84-85-86, 99, 103, 
109, 150-151, 298 

conditionally, 95 

minimal, 92 

natural, 103 
Superefficiency, 414 
Supporting hyperplane theorem, 666 
Sure-thing principle, 184 

t distribution, 672 
Tail a-field, 632 
Tailfree process, 60 
Taylor's theorem, 665 
Tchebychev's inequality, 614 
Terminal decision rule, 537 
Test: 

goodness of fit, 218, 461 

one-sided, 239, 243 

two-sided, 256, 273 
Test function, 175, 215 
Theorem: 

Bahadur, 94 

Basu, 99 

Bayes, 4, 16 

Bhattacharyya lower bounds, 
305 

Bolzano- Weierstrass, 666 
Caratheodory extension, 578 
Cauchy's equation, 667 
central limit, 642 

multivariate, 643 
chain rule, 600 

Chapman-Robbins bound, 304 
complete class, 179 
continuity, 640 
continuous mapping, 638 
Cramer-Rao lower bound, 301 
DeFinetti, 27-28 
dominated convergence, 591 
Fatou's lemma, 589 
Fisher-Neyman, 89 
Fubini, 596 



702 Subject Index 



Hahn decomposition, 605 
inverse function, 666 
Kolmogorov zero-one law, 631 
Levy, 648, 650 
law of total probability, 632 
Lehmann-Scheffe, 298 
martingale convergence, 648-649 
maximum modulus, 667 
minimax, 172 

monotone convergence, 590 
multivariate central limit, 643 
Neyman-Pearson, 175, 231 

generalized, 247 
7T-A, 576 

portmanteau, 636 
Radon-Nikodym, 597 
Rao-Blackwell, 152 
Scheffe, 634 

separating hyperplane, 666 
Stone-Weierstrass, 666 
strong law of large numbers, 36 
supporting hyperplane, 666 
Taylor, 665 
Tonelli, 595 
uniqueness, 645 
upcrossing, 647 

weak law of large numbers, 642 
Tolerance coefficient, 325 
Tolerance set, 219, 325 

conservative, 325 
Tonelli's theorem, 595 
Topological space, 571, 575 
Topology, 571 
Transformation, 354 
Transition kernel, 124 
Trimmed mean, 314 
Trivial cr-field, 571 
Truncated decision rule, 542 
Two-sided alternative, 246 
Two-sided hypothesis, 246 
Two-sided test, 256, 273 
Type I error, 214 
Type II error, 214 

UMA confidence set, 317 
UMAU confidence set, 321 
UMC test, 230-231, 239, 244, 255, 
257 

UMCU test, 254-256 



UMP test, 230, 240, 243-244, 255, 
257 

UMPU test, 254-256 

UMPUAI test, 384 

UMVUE, 297-299 

Unbiased estimator, 3, 296-302 

Unbiased test, 254 

Uniform distribution, 659, 672 

Uniformly integrable, 592 

Uniformly minimum variance 

unbiased estimator (see 

UMVUE), 297 
Uniformly most accurate confidence 

set (see UMA), 317 
Uniformly most accurate unbiased 

confidence set (See UMAU), 

321 

Uniformly most cautious test (see 

UMC), 230 
Uniformly most cautious unbiased 

test (see UMCU), 254 
Uniformly most powerful test (see 

UMP), 230 
Uniformly most powerful unbiased 

test (see UMPU), 254 
Uniqueness theorem, 645 
Upcrossing lemma, 647 
Upper semicontinuous, ^17 
USC, 417 

Utility function, 181, 188 

state-dependent, 205-206 

Variance, 607, 613 
Variance components, 484 
Variance stabilizing transformation, 
402 

Version of conditional distribution, 
617 

Version of conditional expectation, 

608, 616 
Version of conditional mean, 616 

Wald's lemma, 552 

Weak convergence, 399, 635 

Weak* convergence, 635 

Weak law of large numbers, 642, 664 

Weak order, 183, 216-217, 280 

degenerate, 183 

nondegenerate, 183 
Weak preference, 182 



Springer Series in Statistics 



(continued from p. ii) 

Pollard: Convergence of Stochastic Processes. 

Pratt/ Gibbons: Concepts of Nonparametric Theory. 

Read/Cressie: Goodness-of-Fit Statistics for Discrete Multivariate Data. 

Reinsel: Elements of Multivariate Time Series Analysis. 

Reiss: A Course on Point Processes. 

Reiss: Approximate Distributions of Order Statistics: With Applications 

to Non-parametric Statistics. 
Rieder: Robust Asymptotic Statistics. 
Rosenbaum: Observational Studies. 
Ross: Nonlinear Estimation. 

Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition. 
Sarndal/Swensson/Wretman: Model Assisted Survey Sampling. 
Schervish: Theory of Statistics. 

Seneta: Non-Negative Matrices and Markov Chains, 2nd edition. 
Shao/Tu: The Jackknife and Bootstrap. 

Siegmund: Sequential Analysis: Tests and Confidence Intervals. 
Simonoff: Smoothing Methods in Statistics. 
Small: The Statistical Theory of Shape. 

Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior 

Distributions and Likelihood Functions, 3rd edition. 
Tong: The Multivariate Normal Distribution. 

van der Vaart/Wellner: Weak Convergence and Empirical Processes: With 

Applications to Statistics. 
Vapnik: Estimation of Dependences Based on Empirical Data. 
Weerahandi: Exact Statistical Methods for Data Analysis. 
West/ Harrison: Bayesian Forecasting and Dynamic Models. 
Wolter: Introduction to Variance Estimation. 

Yaglom: Correlation Theory of Stationary and Related Random Functions I: 
Basic Results. 

Yaglom: Correlation Theory of Stationary and Related Random Functions II: 
Supplementary Notes and References. 



